This workshop the second workshop designed for the Cancer MSc Students in UCL Cancer Institute to gain some confidence on using R (statistical-) programming language in their MSc projects. I would appreciate if you participate in this pre-course survey (once again) so that I know of your expectation from today’s workshop.

Preface

After failing twice in my driving practical test, I took some time off from driving lessons. The reason was, partly, financial. It was also the beginning of the shorter days of winter and I was a bit worried of taking the exam during those cold days when the roads in Edinburgh became a bit tricky to drive. In the next summer, when I tried to contact my driving instructor again, to my surprise, I came to know that he had changed his profession. Well, my wife still blames me in secret as it was not the first time that my driving instructors had stopped training people or changed his career (though my first instructor took a break due to some family responsibilities).

So, I went to the third driving instructor, Bill. He was in his late 70s, I guess, and initially I had hard time understand what he was saying. You may think that it’s not ideal at all for a driving lesson. But interestingly enough, it worked out in the end and I got my driving license this time. Anyway, Bill used to be an engineer and after his retirement, he started his second career as a driving instructor. On the first day, Bill told me to forget everything I had learned so far on driving. I was a bit shocked indeed with his condescending approach, but when he started the lesson it felt like he was teaching me the grammar of driving - how to control the clutch, how to read the mind the driver of an oncoming vehicle etc.

By now, you may have started to wander, what does Bill have anything to do with you or this workshop? Bear with me. Each time I think about two R packages, namely dpylr and ggplot2, they remind me of Bill. In this workshop on exploratory data analysis using R, we will learn the grammar of data manipulation and grammar of graphics to draw fancy plots. What you have learned in the previous workshop, was not even the tip of an iceberg of plotting with R. Those function that we used were a bit rigid and you have less control over your plots. But here, with the package ggplot2, we will shape up the plots as we wish (at least, to a greater degree). We will draw layer upon layer to incorporate so many aspects of the data in a single plot. And for the data handling, we will use dplyr package. We will add layer of functions, as we progress, to build our data structure for downstream analysis. And as a whole, we will try to tell a story with our plots during the workshop.

1 Introduction

This is the second and the last workshop in this series for this year. In the first workshop, I introduced very basic functions of R for data handling and generating basic plots. Some of you wandered about the utility of R in your biological research in the coming months. That might be due to my choice of very simple datasets that came with base R. I was consciously avoiding a bit complex real-life datasets (especially related to molecular biology or omics) so that you (at least ~70% of you) don’t become startled with it while encountering R syntaxes for the first time.

In the first half of today’s workshop, we will learn a more efficient way of handling / manipulating data using an R package called dplyr and generate plot using another package called ggplot2. However, we will still be using in-built data from base R. Don’t be disheartened; soon we will shift our attention to real-life / clinical data. We will be using few datasets that were part of a study called METABRIC (Molecular Taxonomy of Breast Cancer International Consortium). These datasets characterise the genomic mutations (SNVs and CNAs) and gene expression profiles from over 2000 primary breast tumours. In addition, a detailed clinical information can also be found for this study alongside the experimental data from cBioPortal, which we will integrate to the latter. You can follow the little download sign on that page or you can click here to download the dataset. Save the brca_metabric.tar.gz file to somewhere on your computer and decompress it. We will import some of the files from here.

In this workshop, we are not planning to do any major data analysis, rather we will stick to the realm of (the fancy name) Exploratory data analysis (EDA) by formatting data and plotting some informative plots. We will learn few but important functions (or, verbs) to perform data manipulation. We will find out which was the most prominent among different mutation types. We will also generate a word cloud using most affected genes in the patient cohort.

We will see the expression of GATA3 transcription factor in PAM50 classified samples or samples with different ER status. We will also see the age distribution of the patients for some selected mutated genes. Lastly, we will explore the concept of co-occurrence of mutations among some cancer related genes in the METABRIC cohort.

2 Data manipulation using dplyr

Trust me, this is the part of my research where I spend a significant portion of my time. Real-life data are not polished and nicely annotated. Moreover, when you want to integrate data from different sources, the fun begins (I am showing the quotation finger, of course)! Moreover, you need to format the output from one process and make it worthy for the next one. So, there’s no escape from formatting / manipulating data in real-life.

Here, we will be using the dplyr package which is one of the most powerful and popular packages in R. The d here stands for data and plyr is supposed to be the tool plier. Therefore, dplyr packages refers to a tool to manipulate data(-frame). dplyr provides a grammar of data manipulation and the functions it provides are regarded as the verbs in the code and are very efficient ones in solving most common data manipulation problems. It is sometimes arguably more efficient than the base R operations.

2.1 Install

There are mainly two ways to install dplyr package in R. You can install the tidyverse package and dplyr, being a part of it, will automatically be installed in your R environment.

install.packages("tidyverse")

Or, you can install just the dplyr package by -

install.packages("dplyr")

However, if you want to install the development version, which I won’t recommend at this stage, you can follow the codes below -

if (packageVersion("devtools") < 1.6) {
  install.packages("devtools")
}
devtools::install_github("hadley/lazyeval")
devtools::install_github("hadley/dplyr")

And, now load it …

library(dplyr)

2.2 Pipe operator %>%

It will be a crime not to introduce the pipe operator %>% to you before starting with dplyr verbs. If you are familiar with the pipe operator | in bash scripting, that’s it. I have no better way to describe it to you. But, if you are not, then here is the thing for you -

The pipe operator %>% connects two operations on the same data (be it a vector or a data-frame). It passes the output from the left-hand side operation of it as the first argument to the right-hand side operation. If you want a formal definition: x %>% f(y) is converted into f(x,y) by using the pipe operator.

Let’s look at a example. Say, we have a vector x that holds value from 1 to 100 and we want to calculate the mean of x and make it round to an integer, we write in base R -

x <- 1:100
round(mean(x))
## [1] 50

On the other hand, using the pipe operator, we can first define the x and then calculate the mean and, at the end, round it to an integer, like -

x <- 1:100
x %>% mean %>% round
## [1] 50

It goes from left to right as we think and build our data analysis pipeline. The new version of dplyr also supports |> as the pipe operator, but I will stick to %>% in the workshop.

2.3 dplyr verbs

There are many verbs embedded in the dplyr package. Here I will be discussing a few (but very important ones) that you will need to resolve most of the data manipulation challenges in your day-to-day life.

2.3.1 select()

select() picks variables based on their names or types. For example -

# using specific variable names -
iris %>% 
  select(Sepal.Length, Sepal.Width) 
iris data: Sepal length and width
Sepal.Length Sepal.Width
5.1 3.5
4.9 3.0
4.7 3.2
4.6 3.1
5.0 3.6
5.4 3.9
4.6 3.4
5.0 3.4
4.4 2.9
4.9 3.1
5.4 3.7
4.8 3.4
4.8 3.0
4.3 3.0
5.8 4.0
5.7 4.4
5.4 3.9
5.1 3.5
5.7 3.8
5.1 3.8
5.4 3.4
5.1 3.7
4.6 3.6
5.1 3.3
4.8 3.4
5.0 3.0
5.0 3.4
5.2 3.5
5.2 3.4
4.7 3.2
4.8 3.1
5.4 3.4
5.2 4.1
5.5 4.2
4.9 3.1
5.0 3.2
5.5 3.5
4.9 3.6
4.4 3.0
5.1 3.4
5.0 3.5
4.5 2.3
4.4 3.2
5.0 3.5
5.1 3.8
4.8 3.0
5.1 3.8
4.6 3.2
5.3 3.7
5.0 3.3
7.0 3.2
6.4 3.2
6.9 3.1
5.5 2.3
6.5 2.8
5.7 2.8
6.3 3.3
4.9 2.4
6.6 2.9
5.2 2.7
5.0 2.0
5.9 3.0
6.0 2.2
6.1 2.9
5.6 2.9
6.7 3.1
5.6 3.0
5.8 2.7
6.2 2.2
5.6 2.5
5.9 3.2
6.1 2.8
6.3 2.5
6.1 2.8
6.4 2.9
6.6 3.0
6.8 2.8
6.7 3.0
6.0 2.9
5.7 2.6
5.5 2.4
5.5 2.4
5.8 2.7
6.0 2.7
5.4 3.0
6.0 3.4
6.7 3.1
6.3 2.3
5.6 3.0
5.5 2.5
5.5 2.6
6.1 3.0
5.8 2.6
5.0 2.3
5.6 2.7
5.7 3.0
5.7 2.9
6.2 2.9
5.1 2.5
5.7 2.8
6.3 3.3
5.8 2.7
7.1 3.0
6.3 2.9
6.5 3.0
7.6 3.0
4.9 2.5
7.3 2.9
6.7 2.5
7.2 3.6
6.5 3.2
6.4 2.7
6.8 3.0
5.7 2.5
5.8 2.8
6.4 3.2
6.5 3.0
7.7 3.8
7.7 2.6
6.0 2.2
6.9 3.2
5.6 2.8
7.7 2.8
6.3 2.7
6.7 3.3
7.2 3.2
6.2 2.8
6.1 3.0
6.4 2.8
7.2 3.0
7.4 2.8
7.9 3.8
6.4 2.8
6.3 2.8
6.1 2.6
7.7 3.0
6.3 3.4
6.4 3.1
6.0 3.0
6.9 3.1
6.7 3.1
6.9 3.1
5.8 2.7
6.8 3.2
6.7 3.3
6.7 3.0
6.3 2.5
6.5 3.0
6.2 3.4
5.9 3.0
# using type -
iris %>% 
  select(is.numeric)
iris data: neumeric columns only
Sepal.Length Sepal.Width Petal.Length Petal.Width
5.1 3.5 1.4 0.2
4.9 3.0 1.4 0.2
4.7 3.2 1.3 0.2
4.6 3.1 1.5 0.2
5.0 3.6 1.4 0.2
5.4 3.9 1.7 0.4
4.6 3.4 1.4 0.3
5.0 3.4 1.5 0.2
4.4 2.9 1.4 0.2
4.9 3.1 1.5 0.1
5.4 3.7 1.5 0.2
4.8 3.4 1.6 0.2
4.8 3.0 1.4 0.1
4.3 3.0 1.1 0.1
5.8 4.0 1.2 0.2
5.7 4.4 1.5 0.4
5.4 3.9 1.3 0.4
5.1 3.5 1.4 0.3
5.7 3.8 1.7 0.3
5.1 3.8 1.5 0.3
5.4 3.4 1.7 0.2
5.1 3.7 1.5 0.4
4.6 3.6 1.0 0.2
5.1 3.3 1.7 0.5
4.8 3.4 1.9 0.2
5.0 3.0 1.6 0.2
5.0 3.4 1.6 0.4
5.2 3.5 1.5 0.2
5.2 3.4 1.4 0.2
4.7 3.2 1.6 0.2
4.8 3.1 1.6 0.2
5.4 3.4 1.5 0.4
5.2 4.1 1.5 0.1
5.5 4.2 1.4 0.2
4.9 3.1 1.5 0.2
5.0 3.2 1.2 0.2
5.5 3.5 1.3 0.2
4.9 3.6 1.4 0.1
4.4 3.0 1.3 0.2
5.1 3.4 1.5 0.2
5.0 3.5 1.3 0.3
4.5 2.3 1.3 0.3
4.4 3.2 1.3 0.2
5.0 3.5 1.6 0.6
5.1 3.8 1.9 0.4
4.8 3.0 1.4 0.3
5.1 3.8 1.6 0.2
4.6 3.2 1.4 0.2
5.3 3.7 1.5 0.2
5.0 3.3 1.4 0.2
7.0 3.2 4.7 1.4
6.4 3.2 4.5 1.5
6.9 3.1 4.9 1.5
5.5 2.3 4.0 1.3
6.5 2.8 4.6 1.5
5.7 2.8 4.5 1.3
6.3 3.3 4.7 1.6
4.9 2.4 3.3 1.0
6.6 2.9 4.6 1.3
5.2 2.7 3.9 1.4
5.0 2.0 3.5 1.0
5.9 3.0 4.2 1.5
6.0 2.2 4.0 1.0
6.1 2.9 4.7 1.4
5.6 2.9 3.6 1.3
6.7 3.1 4.4 1.4
5.6 3.0 4.5 1.5
5.8 2.7 4.1 1.0
6.2 2.2 4.5 1.5
5.6 2.5 3.9 1.1
5.9 3.2 4.8 1.8
6.1 2.8 4.0 1.3
6.3 2.5 4.9 1.5
6.1 2.8 4.7 1.2
6.4 2.9 4.3 1.3
6.6 3.0 4.4 1.4
6.8 2.8 4.8 1.4
6.7 3.0 5.0 1.7
6.0 2.9 4.5 1.5
5.7 2.6 3.5 1.0
5.5 2.4 3.8 1.1
5.5 2.4 3.7 1.0
5.8 2.7 3.9 1.2
6.0 2.7 5.1 1.6
5.4 3.0 4.5 1.5
6.0 3.4 4.5 1.6
6.7 3.1 4.7 1.5
6.3 2.3 4.4 1.3
5.6 3.0 4.1 1.3
5.5 2.5 4.0 1.3
5.5 2.6 4.4 1.2
6.1 3.0 4.6 1.4
5.8 2.6 4.0 1.2
5.0 2.3 3.3 1.0
5.6 2.7 4.2 1.3
5.7 3.0 4.2 1.2
5.7 2.9 4.2 1.3
6.2 2.9 4.3 1.3
5.1 2.5 3.0 1.1
5.7 2.8 4.1 1.3
6.3 3.3 6.0 2.5
5.8 2.7 5.1 1.9
7.1 3.0 5.9 2.1
6.3 2.9 5.6 1.8
6.5 3.0 5.8 2.2
7.6 3.0 6.6 2.1
4.9 2.5 4.5 1.7
7.3 2.9 6.3 1.8
6.7 2.5 5.8 1.8
7.2 3.6 6.1 2.5
6.5 3.2 5.1 2.0
6.4 2.7 5.3 1.9
6.8 3.0 5.5 2.1
5.7 2.5 5.0 2.0
5.8 2.8 5.1 2.4
6.4 3.2 5.3 2.3
6.5 3.0 5.5 1.8
7.7 3.8 6.7 2.2
7.7 2.6 6.9 2.3
6.0 2.2 5.0 1.5
6.9 3.2 5.7 2.3
5.6 2.8 4.9 2.0
7.7 2.8 6.7 2.0
6.3 2.7 4.9 1.8
6.7 3.3 5.7 2.1
7.2 3.2 6.0 1.8
6.2 2.8 4.8 1.8
6.1 3.0 4.9 1.8
6.4 2.8 5.6 2.1
7.2 3.0 5.8 1.6
7.4 2.8 6.1 1.9
7.9 3.8 6.4 2.0
6.4 2.8 5.6 2.2
6.3 2.8 5.1 1.5
6.1 2.6 5.6 1.4
7.7 3.0 6.1 2.3
6.3 3.4 5.6 2.4
6.4 3.1 5.5 1.8
6.0 3.0 4.8 1.8
6.9 3.1 5.4 2.1
6.7 3.1 5.6 2.4
6.9 3.1 5.1 2.3
5.8 2.7 5.1 1.9
6.8 3.2 5.9 2.3
6.7 3.3 5.7 2.5
6.7 3.0 5.2 2.3
6.3 2.5 5.0 1.9
6.5 3.0 5.2 2.0
6.2 3.4 5.4 2.3
5.9 3.0 5.1 1.8


With the verb select(), comes some selection helpers -

If you want to select all the variables, you can use everything()

iris %>% 
  select(everything())
iris data: everything
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
5.1 3.5 1.4 0.2 setosa
4.9 3.0 1.4 0.2 setosa
4.7 3.2 1.3 0.2 setosa
4.6 3.1 1.5 0.2 setosa
5.0 3.6 1.4 0.2 setosa
5.4 3.9 1.7 0.4 setosa
4.6 3.4 1.4 0.3 setosa
5.0 3.4 1.5 0.2 setosa
4.4 2.9 1.4 0.2 setosa
4.9 3.1 1.5 0.1 setosa
5.4 3.7 1.5 0.2 setosa
4.8 3.4 1.6 0.2 setosa
4.8 3.0 1.4 0.1 setosa
4.3 3.0 1.1 0.1 setosa
5.8 4.0 1.2 0.2 setosa
5.7 4.4 1.5 0.4 setosa
5.4 3.9 1.3 0.4 setosa
5.1 3.5 1.4 0.3 setosa
5.7 3.8 1.7 0.3 setosa
5.1 3.8 1.5 0.3 setosa
5.4 3.4 1.7 0.2 setosa
5.1 3.7 1.5 0.4 setosa
4.6 3.6 1.0 0.2 setosa
5.1 3.3 1.7 0.5 setosa
4.8 3.4 1.9 0.2 setosa
5.0 3.0 1.6 0.2 setosa
5.0 3.4 1.6 0.4 setosa
5.2 3.5 1.5 0.2 setosa
5.2 3.4 1.4 0.2 setosa
4.7 3.2 1.6 0.2 setosa
4.8 3.1 1.6 0.2 setosa
5.4 3.4 1.5 0.4 setosa
5.2 4.1 1.5 0.1 setosa
5.5 4.2 1.4 0.2 setosa
4.9 3.1 1.5 0.2 setosa
5.0 3.2 1.2 0.2 setosa
5.5 3.5 1.3 0.2 setosa
4.9 3.6 1.4 0.1 setosa
4.4 3.0 1.3 0.2 setosa
5.1 3.4 1.5 0.2 setosa
5.0 3.5 1.3 0.3 setosa
4.5 2.3 1.3 0.3 setosa
4.4 3.2 1.3 0.2 setosa
5.0 3.5 1.6 0.6 setosa
5.1 3.8 1.9 0.4 setosa
4.8 3.0 1.4 0.3 setosa
5.1 3.8 1.6 0.2 setosa
4.6 3.2 1.4 0.2 setosa
5.3 3.7 1.5 0.2 setosa
5.0 3.3 1.4 0.2 setosa
7.0 3.2 4.7 1.4 versicolor
6.4 3.2 4.5 1.5 versicolor
6.9 3.1 4.9 1.5 versicolor
5.5 2.3 4.0 1.3 versicolor
6.5 2.8 4.6 1.5 versicolor
5.7 2.8 4.5 1.3 versicolor
6.3 3.3 4.7 1.6 versicolor
4.9 2.4 3.3 1.0 versicolor
6.6 2.9 4.6 1.3 versicolor
5.2 2.7 3.9 1.4 versicolor
5.0 2.0 3.5 1.0 versicolor
5.9 3.0 4.2 1.5 versicolor
6.0 2.2 4.0 1.0 versicolor
6.1 2.9 4.7 1.4 versicolor
5.6 2.9 3.6 1.3 versicolor
6.7 3.1 4.4 1.4 versicolor
5.6 3.0 4.5 1.5 versicolor
5.8 2.7 4.1 1.0 versicolor
6.2 2.2 4.5 1.5 versicolor
5.6 2.5 3.9 1.1 versicolor
5.9 3.2 4.8 1.8 versicolor
6.1 2.8 4.0 1.3 versicolor
6.3 2.5 4.9 1.5 versicolor
6.1 2.8 4.7 1.2 versicolor
6.4 2.9 4.3 1.3 versicolor
6.6 3.0 4.4 1.4 versicolor
6.8 2.8 4.8 1.4 versicolor
6.7 3.0 5.0 1.7 versicolor
6.0 2.9 4.5 1.5 versicolor
5.7 2.6 3.5 1.0 versicolor
5.5 2.4 3.8 1.1 versicolor
5.5 2.4 3.7 1.0 versicolor
5.8 2.7 3.9 1.2 versicolor
6.0 2.7 5.1 1.6 versicolor
5.4 3.0 4.5 1.5 versicolor
6.0 3.4 4.5 1.6 versicolor
6.7 3.1 4.7 1.5 versicolor
6.3 2.3 4.4 1.3 versicolor
5.6 3.0 4.1 1.3 versicolor
5.5 2.5 4.0 1.3 versicolor
5.5 2.6 4.4 1.2 versicolor
6.1 3.0 4.6 1.4 versicolor
5.8 2.6 4.0 1.2 versicolor
5.0 2.3 3.3 1.0 versicolor
5.6 2.7 4.2 1.3 versicolor
5.7 3.0 4.2 1.2 versicolor
5.7 2.9 4.2 1.3 versicolor
6.2 2.9 4.3 1.3 versicolor
5.1 2.5 3.0 1.1 versicolor
5.7 2.8 4.1 1.3 versicolor
6.3 3.3 6.0 2.5 virginica
5.8 2.7 5.1 1.9 virginica
7.1 3.0 5.9 2.1 virginica
6.3 2.9 5.6 1.8 virginica
6.5 3.0 5.8 2.2 virginica
7.6 3.0 6.6 2.1 virginica
4.9 2.5 4.5 1.7 virginica
7.3 2.9 6.3 1.8 virginica
6.7 2.5 5.8 1.8 virginica
7.2 3.6 6.1 2.5 virginica
6.5 3.2 5.1 2.0 virginica
6.4 2.7 5.3 1.9 virginica
6.8 3.0 5.5 2.1 virginica
5.7 2.5 5.0 2.0 virginica
5.8 2.8 5.1 2.4 virginica
6.4 3.2 5.3 2.3 virginica
6.5 3.0 5.5 1.8 virginica
7.7 3.8 6.7 2.2 virginica
7.7 2.6 6.9 2.3 virginica
6.0 2.2 5.0 1.5 virginica
6.9 3.2 5.7 2.3 virginica
5.6 2.8 4.9 2.0 virginica
7.7 2.8 6.7 2.0 virginica
6.3 2.7 4.9 1.8 virginica
6.7 3.3 5.7 2.1 virginica
7.2 3.2 6.0 1.8 virginica
6.2 2.8 4.8 1.8 virginica
6.1 3.0 4.9 1.8 virginica
6.4 2.8 5.6 2.1 virginica
7.2 3.0 5.8 1.6 virginica
7.4 2.8 6.1 1.9 virginica
7.9 3.8 6.4 2.0 virginica
6.4 2.8 5.6 2.2 virginica
6.3 2.8 5.1 1.5 virginica
6.1 2.6 5.6 1.4 virginica
7.7 3.0 6.1 2.3 virginica
6.3 3.4 5.6 2.4 virginica
6.4 3.1 5.5 1.8 virginica
6.0 3.0 4.8 1.8 virginica
6.9 3.1 5.4 2.1 virginica
6.7 3.1 5.6 2.4 virginica
6.9 3.1 5.1 2.3 virginica
5.8 2.7 5.1 1.9 virginica
6.8 3.2 5.9 2.3 virginica
6.7 3.3 5.7 2.5 virginica
6.7 3.0 5.2 2.3 virginica
6.3 2.5 5.0 1.9 virginica
6.5 3.0 5.2 2.0 virginica
6.2 3.4 5.4 2.3 virginica
5.9 3.0 5.1 1.8 virginica


You can choose the last column using last_col() or only columns that are grouped using group_cols() (You will understand better when I discuss the group_by() verb later).

# select the last column
iris %>% 
  select(last_col())
iris data: last_col()
Species
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
versicolor
versicolor
versicolor
versicolor
versicolor
versicolor
versicolor
versicolor
versicolor
versicolor
versicolor
versicolor
versicolor
versicolor
versicolor
versicolor
versicolor
versicolor
versicolor
versicolor
versicolor
versicolor
versicolor
versicolor
versicolor
versicolor
versicolor
versicolor
versicolor
versicolor
versicolor
versicolor
versicolor
versicolor
versicolor
versicolor
versicolor
versicolor
versicolor
versicolor
versicolor
versicolor
versicolor
versicolor
versicolor
versicolor
versicolor
versicolor
versicolor
versicolor
virginica
virginica
virginica
virginica
virginica
virginica
virginica
virginica
virginica
virginica
virginica
virginica
virginica
virginica
virginica
virginica
virginica
virginica
virginica
virginica
virginica
virginica
virginica
virginica
virginica
virginica
virginica
virginica
virginica
virginica
virginica
virginica
virginica
virginica
virginica
virginica
virginica
virginica
virginica
virginica
virginica
virginica
virginica
virginica
virginica
virginica
virginica
virginica
virginica
virginica
# select the grouped column(s)
iris %>% 
  group_by(Sepal.Length,Sepal.Width) %>% 
  select(group_cols())
iris data: select grouped columns
Sepal.Length Sepal.Width
5.1 3.5
4.9 3.0
4.7 3.2
4.6 3.1
5.0 3.6
5.4 3.9
4.6 3.4
5.0 3.4
4.4 2.9
4.9 3.1
5.4 3.7
4.8 3.4
4.8 3.0
4.3 3.0
5.8 4.0
5.7 4.4
5.4 3.9
5.1 3.5
5.7 3.8
5.1 3.8
5.4 3.4
5.1 3.7
4.6 3.6
5.1 3.3
4.8 3.4
5.0 3.0
5.0 3.4
5.2 3.5
5.2 3.4
4.7 3.2
4.8 3.1
5.4 3.4
5.2 4.1
5.5 4.2
4.9 3.1
5.0 3.2
5.5 3.5
4.9 3.6
4.4 3.0
5.1 3.4
5.0 3.5
4.5 2.3
4.4 3.2
5.0 3.5
5.1 3.8
4.8 3.0
5.1 3.8
4.6 3.2
5.3 3.7
5.0 3.3
7.0 3.2
6.4 3.2
6.9 3.1
5.5 2.3
6.5 2.8
5.7 2.8
6.3 3.3
4.9 2.4
6.6 2.9
5.2 2.7
5.0 2.0
5.9 3.0
6.0 2.2
6.1 2.9
5.6 2.9
6.7 3.1
5.6 3.0
5.8 2.7
6.2 2.2
5.6 2.5
5.9 3.2
6.1 2.8
6.3 2.5
6.1 2.8
6.4 2.9
6.6 3.0
6.8 2.8
6.7 3.0
6.0 2.9
5.7 2.6
5.5 2.4
5.5 2.4
5.8 2.7
6.0 2.7
5.4 3.0
6.0 3.4
6.7 3.1
6.3 2.3
5.6 3.0
5.5 2.5
5.5 2.6
6.1 3.0
5.8 2.6
5.0 2.3
5.6 2.7
5.7 3.0
5.7 2.9
6.2 2.9
5.1 2.5
5.7 2.8
6.3 3.3
5.8 2.7
7.1 3.0
6.3 2.9
6.5 3.0
7.6 3.0
4.9 2.5
7.3 2.9
6.7 2.5
7.2 3.6
6.5 3.2
6.4 2.7
6.8 3.0
5.7 2.5
5.8 2.8
6.4 3.2
6.5 3.0
7.7 3.8
7.7 2.6
6.0 2.2
6.9 3.2
5.6 2.8
7.7 2.8
6.3 2.7
6.7 3.3
7.2 3.2
6.2 2.8
6.1 3.0
6.4 2.8
7.2 3.0
7.4 2.8
7.9 3.8
6.4 2.8
6.3 2.8
6.1 2.6
7.7 3.0
6.3 3.4
6.4 3.1
6.0 3.0
6.9 3.1
6.7 3.1
6.9 3.1
5.8 2.7
6.8 3.2
6.7 3.3
6.7 3.0
6.3 2.5
6.5 3.0
6.2 3.4
5.9 3.0


If there’s a common prefix or suffix to some column names, you can utilise that by using selection helpers starts_with() or ends_with(), respectively -

# starts_with()
iris %>% 
  select(starts_with("Sepal"))
iris data: columns starts with Sepal
Sepal.Length Sepal.Width
5.1 3.5
4.9 3.0
4.7 3.2
4.6 3.1
5.0 3.6
5.4 3.9
4.6 3.4
5.0 3.4
4.4 2.9
4.9 3.1
5.4 3.7
4.8 3.4
4.8 3.0
4.3 3.0
5.8 4.0
5.7 4.4
5.4 3.9
5.1 3.5
5.7 3.8
5.1 3.8
5.4 3.4
5.1 3.7
4.6 3.6
5.1 3.3
4.8 3.4
5.0 3.0
5.0 3.4
5.2 3.5
5.2 3.4
4.7 3.2
4.8 3.1
5.4 3.4
5.2 4.1
5.5 4.2
4.9 3.1
5.0 3.2
5.5 3.5
4.9 3.6
4.4 3.0
5.1 3.4
5.0 3.5
4.5 2.3
4.4 3.2
5.0 3.5
5.1 3.8
4.8 3.0
5.1 3.8
4.6 3.2
5.3 3.7
5.0 3.3
7.0 3.2
6.4 3.2
6.9 3.1
5.5 2.3
6.5 2.8
5.7 2.8
6.3 3.3
4.9 2.4
6.6 2.9
5.2 2.7
5.0 2.0
5.9 3.0
6.0 2.2
6.1 2.9
5.6 2.9
6.7 3.1
5.6 3.0
5.8 2.7
6.2 2.2
5.6 2.5
5.9 3.2
6.1 2.8
6.3 2.5
6.1 2.8
6.4 2.9
6.6 3.0
6.8 2.8
6.7 3.0
6.0 2.9
5.7 2.6
5.5 2.4
5.5 2.4
5.8 2.7
6.0 2.7
5.4 3.0
6.0 3.4
6.7 3.1
6.3 2.3
5.6 3.0
5.5 2.5
5.5 2.6
6.1 3.0
5.8 2.6
5.0 2.3
5.6 2.7
5.7 3.0
5.7 2.9
6.2 2.9
5.1 2.5
5.7 2.8
6.3 3.3
5.8 2.7
7.1 3.0
6.3 2.9
6.5 3.0
7.6 3.0
4.9 2.5
7.3 2.9
6.7 2.5
7.2 3.6
6.5 3.2
6.4 2.7
6.8 3.0
5.7 2.5
5.8 2.8
6.4 3.2
6.5 3.0
7.7 3.8
7.7 2.6
6.0 2.2
6.9 3.2
5.6 2.8
7.7 2.8
6.3 2.7
6.7 3.3
7.2 3.2
6.2 2.8
6.1 3.0
6.4 2.8
7.2 3.0
7.4 2.8
7.9 3.8
6.4 2.8
6.3 2.8
6.1 2.6
7.7 3.0
6.3 3.4
6.4 3.1
6.0 3.0
6.9 3.1
6.7 3.1
6.9 3.1
5.8 2.7
6.8 3.2
6.7 3.3
6.7 3.0
6.3 2.5
6.5 3.0
6.2 3.4
5.9 3.0
# ends_with()
iris %>% 
  select(ends_with("Length"))
iris data: columns ends with Length
Sepal.Length Petal.Length
5.1 1.4
4.9 1.4
4.7 1.3
4.6 1.5
5.0 1.4
5.4 1.7
4.6 1.4
5.0 1.5
4.4 1.4
4.9 1.5
5.4 1.5
4.8 1.6
4.8 1.4
4.3 1.1
5.8 1.2
5.7 1.5
5.4 1.3
5.1 1.4
5.7 1.7
5.1 1.5
5.4 1.7
5.1 1.5
4.6 1.0
5.1 1.7
4.8 1.9
5.0 1.6
5.0 1.6
5.2 1.5
5.2 1.4
4.7 1.6
4.8 1.6
5.4 1.5
5.2 1.5
5.5 1.4
4.9 1.5
5.0 1.2
5.5 1.3
4.9 1.4
4.4 1.3
5.1 1.5
5.0 1.3
4.5 1.3
4.4 1.3
5.0 1.6
5.1 1.9
4.8 1.4
5.1 1.6
4.6 1.4
5.3 1.5
5.0 1.4
7.0 4.7
6.4 4.5
6.9 4.9
5.5 4.0
6.5 4.6
5.7 4.5
6.3 4.7
4.9 3.3
6.6 4.6
5.2 3.9
5.0 3.5
5.9 4.2
6.0 4.0
6.1 4.7
5.6 3.6
6.7 4.4
5.6 4.5
5.8 4.1
6.2 4.5
5.6 3.9
5.9 4.8
6.1 4.0
6.3 4.9
6.1 4.7
6.4 4.3
6.6 4.4
6.8 4.8
6.7 5.0
6.0 4.5
5.7 3.5
5.5 3.8
5.5 3.7
5.8 3.9
6.0 5.1
5.4 4.5
6.0 4.5
6.7 4.7
6.3 4.4
5.6 4.1
5.5 4.0
5.5 4.4
6.1 4.6
5.8 4.0
5.0 3.3
5.6 4.2
5.7 4.2
5.7 4.2
6.2 4.3
5.1 3.0
5.7 4.1
6.3 6.0
5.8 5.1
7.1 5.9
6.3 5.6
6.5 5.8
7.6 6.6
4.9 4.5
7.3 6.3
6.7 5.8
7.2 6.1
6.5 5.1
6.4 5.3
6.8 5.5
5.7 5.0
5.8 5.1
6.4 5.3
6.5 5.5
7.7 6.7
7.7 6.9
6.0 5.0
6.9 5.7
5.6 4.9
7.7 6.7
6.3 4.9
6.7 5.7
7.2 6.0
6.2 4.8
6.1 4.9
6.4 5.6
7.2 5.8
7.4 6.1
7.9 6.4
6.4 5.6
6.3 5.1
6.1 5.6
7.7 6.1
6.3 5.6
6.4 5.5
6.0 4.8
6.9 5.4
6.7 5.6
6.9 5.1
5.8 5.1
6.8 5.9
6.7 5.7
6.7 5.2
6.3 5.0
6.5 5.2
6.2 5.4
5.9 5.1


Even, an internal pattern can be used to select a column by using contains() -

iris %>% 
  select(contains("dth"))
iris data: column names containing ‘dth’
Sepal.Width Petal.Width
3.5 0.2
3.0 0.2
3.2 0.2
3.1 0.2
3.6 0.2
3.9 0.4
3.4 0.3
3.4 0.2
2.9 0.2
3.1 0.1
3.7 0.2
3.4 0.2
3.0 0.1
3.0 0.1
4.0 0.2
4.4 0.4
3.9 0.4
3.5 0.3
3.8 0.3
3.8 0.3
3.4 0.2
3.7 0.4
3.6 0.2
3.3 0.5
3.4 0.2
3.0 0.2
3.4 0.4
3.5 0.2
3.4 0.2
3.2 0.2
3.1 0.2
3.4 0.4
4.1 0.1
4.2 0.2
3.1 0.2
3.2 0.2
3.5 0.2
3.6 0.1
3.0 0.2
3.4 0.2
3.5 0.3
2.3 0.3
3.2 0.2
3.5 0.6
3.8 0.4
3.0 0.3
3.8 0.2
3.2 0.2
3.7 0.2
3.3 0.2
3.2 1.4
3.2 1.5
3.1 1.5
2.3 1.3
2.8 1.5
2.8 1.3
3.3 1.6
2.4 1.0
2.9 1.3
2.7 1.4
2.0 1.0
3.0 1.5
2.2 1.0
2.9 1.4
2.9 1.3
3.1 1.4
3.0 1.5
2.7 1.0
2.2 1.5
2.5 1.1
3.2 1.8
2.8 1.3
2.5 1.5
2.8 1.2
2.9 1.3
3.0 1.4
2.8 1.4
3.0 1.7
2.9 1.5
2.6 1.0
2.4 1.1
2.4 1.0
2.7 1.2
2.7 1.6
3.0 1.5
3.4 1.6
3.1 1.5
2.3 1.3
3.0 1.3
2.5 1.3
2.6 1.2
3.0 1.4
2.6 1.2
2.3 1.0
2.7 1.3
3.0 1.2
2.9 1.3
2.9 1.3
2.5 1.1
2.8 1.3
3.3 2.5
2.7 1.9
3.0 2.1
2.9 1.8
3.0 2.2
3.0 2.1
2.5 1.7
2.9 1.8
2.5 1.8
3.6 2.5
3.2 2.0
2.7 1.9
3.0 2.1
2.5 2.0
2.8 2.4
3.2 2.3
3.0 1.8
3.8 2.2
2.6 2.3
2.2 1.5
3.2 2.3
2.8 2.0
2.8 2.0
2.7 1.8
3.3 2.1
3.2 1.8
2.8 1.8
3.0 1.8
2.8 2.1
3.0 1.6
2.8 1.9
3.8 2.0
2.8 2.2
2.8 1.5
2.6 1.4
3.0 2.3
3.4 2.4
3.1 1.8
3.0 1.8
3.1 2.1
3.1 2.4
3.1 2.3
2.7 1.9
3.2 2.3
3.3 2.5
3.0 2.3
2.5 1.9
3.0 2.0
3.4 2.3
3.0 1.8


Even, you can use regular expression to select a column by using matches() -

# column name containing either W or d or both
iris %>% 
  select(matches("[Wd]"))
iris data: column name containing W or d
Sepal.Width Petal.Width
3.5 0.2
3.0 0.2
3.2 0.2
3.1 0.2
3.6 0.2
3.9 0.4
3.4 0.3
3.4 0.2
2.9 0.2
3.1 0.1
3.7 0.2
3.4 0.2
3.0 0.1
3.0 0.1
4.0 0.2
4.4 0.4
3.9 0.4
3.5 0.3
3.8 0.3
3.8 0.3
3.4 0.2
3.7 0.4
3.6 0.2
3.3 0.5
3.4 0.2
3.0 0.2
3.4 0.4
3.5 0.2
3.4 0.2
3.2 0.2
3.1 0.2
3.4 0.4
4.1 0.1
4.2 0.2
3.1 0.2
3.2 0.2
3.5 0.2
3.6 0.1
3.0 0.2
3.4 0.2
3.5 0.3
2.3 0.3
3.2 0.2
3.5 0.6
3.8 0.4
3.0 0.3
3.8 0.2
3.2 0.2
3.7 0.2
3.3 0.2
3.2 1.4
3.2 1.5
3.1 1.5
2.3 1.3
2.8 1.5
2.8 1.3
3.3 1.6
2.4 1.0
2.9 1.3
2.7 1.4
2.0 1.0
3.0 1.5
2.2 1.0
2.9 1.4
2.9 1.3
3.1 1.4
3.0 1.5
2.7 1.0
2.2 1.5
2.5 1.1
3.2 1.8
2.8 1.3
2.5 1.5
2.8 1.2
2.9 1.3
3.0 1.4
2.8 1.4
3.0 1.7
2.9 1.5
2.6 1.0
2.4 1.1
2.4 1.0
2.7 1.2
2.7 1.6
3.0 1.5
3.4 1.6
3.1 1.5
2.3 1.3
3.0 1.3
2.5 1.3
2.6 1.2
3.0 1.4
2.6 1.2
2.3 1.0
2.7 1.3
3.0 1.2
2.9 1.3
2.9 1.3
2.5 1.1
2.8 1.3
3.3 2.5
2.7 1.9
3.0 2.1
2.9 1.8
3.0 2.2
3.0 2.1
2.5 1.7
2.9 1.8
2.5 1.8
3.6 2.5
3.2 2.0
2.7 1.9
3.0 2.1
2.5 2.0
2.8 2.4
3.2 2.3
3.0 1.8
3.8 2.2
2.6 2.3
2.2 1.5
3.2 2.3
2.8 2.0
2.8 2.0
2.7 1.8
3.3 2.1
3.2 1.8
2.8 1.8
3.0 1.8
2.8 2.1
3.0 1.6
2.8 1.9
3.8 2.0
2.8 2.2
2.8 1.5
2.6 1.4
3.0 2.3
3.4 2.4
3.1 1.8
3.0 1.8
3.1 2.1
3.1 2.4
3.1 2.3
2.7 1.9
3.2 2.3
3.3 2.5
3.0 2.3
2.5 1.9
3.0 2.0
3.4 2.3
3.0 1.8

2.3.2 filter()

The filter() verb is used to subset a data-frame based on one or more conditions imposed on the row(s). Only the elements (along the column) that satisfy the condition(s) remain and others (along with the whole row) get filtered out. There are some functions and operators that you should know while dealing with filter() verb, like -

==, >, <, >=, <=
&, |,  !
is.na()
%in%

Let’s see some examples -

# choose the rows whose Petal.Width is greater than 2
iris %>% 
  filter(Petal.Width > 2)
iris data: Petal width creater than 2
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
6.3 3.3 6.0 2.5 virginica
7.1 3.0 5.9 2.1 virginica
6.5 3.0 5.8 2.2 virginica
7.6 3.0 6.6 2.1 virginica
7.2 3.6 6.1 2.5 virginica
6.8 3.0 5.5 2.1 virginica
5.8 2.8 5.1 2.4 virginica
6.4 3.2 5.3 2.3 virginica
7.7 3.8 6.7 2.2 virginica
7.7 2.6 6.9 2.3 virginica
6.9 3.2 5.7 2.3 virginica
6.7 3.3 5.7 2.1 virginica
6.4 2.8 5.6 2.1 virginica
6.4 2.8 5.6 2.2 virginica
7.7 3.0 6.1 2.3 virginica
6.3 3.4 5.6 2.4 virginica
6.9 3.1 5.4 2.1 virginica
6.7 3.1 5.6 2.4 virginica
6.9 3.1 5.1 2.3 virginica
6.8 3.2 5.9 2.3 virginica
6.7 3.3 5.7 2.5 virginica
6.7 3.0 5.2 2.3 virginica
6.2 3.4 5.4 2.3 virginica
# choose the rows for setosa Species
iris %>% 
  filter(Species == "setosa")
  # filter(Species %in% "setosa")
iris data: setosa only
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
5.1 3.5 1.4 0.2 setosa
4.9 3.0 1.4 0.2 setosa
4.7 3.2 1.3 0.2 setosa
4.6 3.1 1.5 0.2 setosa
5.0 3.6 1.4 0.2 setosa
5.4 3.9 1.7 0.4 setosa
4.6 3.4 1.4 0.3 setosa
5.0 3.4 1.5 0.2 setosa
4.4 2.9 1.4 0.2 setosa
4.9 3.1 1.5 0.1 setosa
5.4 3.7 1.5 0.2 setosa
4.8 3.4 1.6 0.2 setosa
4.8 3.0 1.4 0.1 setosa
4.3 3.0 1.1 0.1 setosa
5.8 4.0 1.2 0.2 setosa
5.7 4.4 1.5 0.4 setosa
5.4 3.9 1.3 0.4 setosa
5.1 3.5 1.4 0.3 setosa
5.7 3.8 1.7 0.3 setosa
5.1 3.8 1.5 0.3 setosa
5.4 3.4 1.7 0.2 setosa
5.1 3.7 1.5 0.4 setosa
4.6 3.6 1.0 0.2 setosa
5.1 3.3 1.7 0.5 setosa
4.8 3.4 1.9 0.2 setosa
5.0 3.0 1.6 0.2 setosa
5.0 3.4 1.6 0.4 setosa
5.2 3.5 1.5 0.2 setosa
5.2 3.4 1.4 0.2 setosa
4.7 3.2 1.6 0.2 setosa
4.8 3.1 1.6 0.2 setosa
5.4 3.4 1.5 0.4 setosa
5.2 4.1 1.5 0.1 setosa
5.5 4.2 1.4 0.2 setosa
4.9 3.1 1.5 0.2 setosa
5.0 3.2 1.2 0.2 setosa
5.5 3.5 1.3 0.2 setosa
4.9 3.6 1.4 0.1 setosa
4.4 3.0 1.3 0.2 setosa
5.1 3.4 1.5 0.2 setosa
5.0 3.5 1.3 0.3 setosa
4.5 2.3 1.3 0.3 setosa
4.4 3.2 1.3 0.2 setosa
5.0 3.5 1.6 0.6 setosa
5.1 3.8 1.9 0.4 setosa
4.8 3.0 1.4 0.3 setosa
5.1 3.8 1.6 0.2 setosa
4.6 3.2 1.4 0.2 setosa
5.3 3.7 1.5 0.2 setosa
5.0 3.3 1.4 0.2 setosa
# or even the opposite is True
iris %>% filter(Species != "setosa")
iris data: without setosa
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
7.0 3.2 4.7 1.4 versicolor
6.4 3.2 4.5 1.5 versicolor
6.9 3.1 4.9 1.5 versicolor
5.5 2.3 4.0 1.3 versicolor
6.5 2.8 4.6 1.5 versicolor
5.7 2.8 4.5 1.3 versicolor
6.3 3.3 4.7 1.6 versicolor
4.9 2.4 3.3 1.0 versicolor
6.6 2.9 4.6 1.3 versicolor
5.2 2.7 3.9 1.4 versicolor
5.0 2.0 3.5 1.0 versicolor
5.9 3.0 4.2 1.5 versicolor
6.0 2.2 4.0 1.0 versicolor
6.1 2.9 4.7 1.4 versicolor
5.6 2.9 3.6 1.3 versicolor
6.7 3.1 4.4 1.4 versicolor
5.6 3.0 4.5 1.5 versicolor
5.8 2.7 4.1 1.0 versicolor
6.2 2.2 4.5 1.5 versicolor
5.6 2.5 3.9 1.1 versicolor
5.9 3.2 4.8 1.8 versicolor
6.1 2.8 4.0 1.3 versicolor
6.3 2.5 4.9 1.5 versicolor
6.1 2.8 4.7 1.2 versicolor
6.4 2.9 4.3 1.3 versicolor
6.6 3.0 4.4 1.4 versicolor
6.8 2.8 4.8 1.4 versicolor
6.7 3.0 5.0 1.7 versicolor
6.0 2.9 4.5 1.5 versicolor
5.7 2.6 3.5 1.0 versicolor
5.5 2.4 3.8 1.1 versicolor
5.5 2.4 3.7 1.0 versicolor
5.8 2.7 3.9 1.2 versicolor
6.0 2.7 5.1 1.6 versicolor
5.4 3.0 4.5 1.5 versicolor
6.0 3.4 4.5 1.6 versicolor
6.7 3.1 4.7 1.5 versicolor
6.3 2.3 4.4 1.3 versicolor
5.6 3.0 4.1 1.3 versicolor
5.5 2.5 4.0 1.3 versicolor
5.5 2.6 4.4 1.2 versicolor
6.1 3.0 4.6 1.4 versicolor
5.8 2.6 4.0 1.2 versicolor
5.0 2.3 3.3 1.0 versicolor
5.6 2.7 4.2 1.3 versicolor
5.7 3.0 4.2 1.2 versicolor
5.7 2.9 4.2 1.3 versicolor
6.2 2.9 4.3 1.3 versicolor
5.1 2.5 3.0 1.1 versicolor
5.7 2.8 4.1 1.3 versicolor
6.3 3.3 6.0 2.5 virginica
5.8 2.7 5.1 1.9 virginica
7.1 3.0 5.9 2.1 virginica
6.3 2.9 5.6 1.8 virginica
6.5 3.0 5.8 2.2 virginica
7.6 3.0 6.6 2.1 virginica
4.9 2.5 4.5 1.7 virginica
7.3 2.9 6.3 1.8 virginica
6.7 2.5 5.8 1.8 virginica
7.2 3.6 6.1 2.5 virginica
6.5 3.2 5.1 2.0 virginica
6.4 2.7 5.3 1.9 virginica
6.8 3.0 5.5 2.1 virginica
5.7 2.5 5.0 2.0 virginica
5.8 2.8 5.1 2.4 virginica
6.4 3.2 5.3 2.3 virginica
6.5 3.0 5.5 1.8 virginica
7.7 3.8 6.7 2.2 virginica
7.7 2.6 6.9 2.3 virginica
6.0 2.2 5.0 1.5 virginica
6.9 3.2 5.7 2.3 virginica
5.6 2.8 4.9 2.0 virginica
7.7 2.8 6.7 2.0 virginica
6.3 2.7 4.9 1.8 virginica
6.7 3.3 5.7 2.1 virginica
7.2 3.2 6.0 1.8 virginica
6.2 2.8 4.8 1.8 virginica
6.1 3.0 4.9 1.8 virginica
6.4 2.8 5.6 2.1 virginica
7.2 3.0 5.8 1.6 virginica
7.4 2.8 6.1 1.9 virginica
7.9 3.8 6.4 2.0 virginica
6.4 2.8 5.6 2.2 virginica
6.3 2.8 5.1 1.5 virginica
6.1 2.6 5.6 1.4 virginica
7.7 3.0 6.1 2.3 virginica
6.3 3.4 5.6 2.4 virginica
6.4 3.1 5.5 1.8 virginica
6.0 3.0 4.8 1.8 virginica
6.9 3.1 5.4 2.1 virginica
6.7 3.1 5.6 2.4 virginica
6.9 3.1 5.1 2.3 virginica
5.8 2.7 5.1 1.9 virginica
6.8 3.2 5.9 2.3 virginica
6.7 3.3 5.7 2.5 virginica
6.7 3.0 5.2 2.3 virginica
6.3 2.5 5.0 1.9 virginica
6.5 3.0 5.2 2.0 virginica
6.2 3.4 5.4 2.3 virginica
5.9 3.0 5.1 1.8 virginica

2.3.3 mutate()

The verb mutate() creates new columns and often the element of the new column can be functions of the existing variables (i.e. columns).

iris %>% 
  mutate(Length_difference = Sepal.Length - Petal.Length) # not that the new column here make much sense
iris data: new column added
Sepal.Length Sepal.Width Petal.Length Petal.Width Species Length_difference
5.1 3.5 1.4 0.2 setosa 3.7
4.9 3.0 1.4 0.2 setosa 3.5
4.7 3.2 1.3 0.2 setosa 3.4
4.6 3.1 1.5 0.2 setosa 3.1
5.0 3.6 1.4 0.2 setosa 3.6
5.4 3.9 1.7 0.4 setosa 3.7
4.6 3.4 1.4 0.3 setosa 3.2
5.0 3.4 1.5 0.2 setosa 3.5
4.4 2.9 1.4 0.2 setosa 3.0
4.9 3.1 1.5 0.1 setosa 3.4
5.4 3.7 1.5 0.2 setosa 3.9
4.8 3.4 1.6 0.2 setosa 3.2
4.8 3.0 1.4 0.1 setosa 3.4
4.3 3.0 1.1 0.1 setosa 3.2
5.8 4.0 1.2 0.2 setosa 4.6
5.7 4.4 1.5 0.4 setosa 4.2
5.4 3.9 1.3 0.4 setosa 4.1
5.1 3.5 1.4 0.3 setosa 3.7
5.7 3.8 1.7 0.3 setosa 4.0
5.1 3.8 1.5 0.3 setosa 3.6
5.4 3.4 1.7 0.2 setosa 3.7
5.1 3.7 1.5 0.4 setosa 3.6
4.6 3.6 1.0 0.2 setosa 3.6
5.1 3.3 1.7 0.5 setosa 3.4
4.8 3.4 1.9 0.2 setosa 2.9
5.0 3.0 1.6 0.2 setosa 3.4
5.0 3.4 1.6 0.4 setosa 3.4
5.2 3.5 1.5 0.2 setosa 3.7
5.2 3.4 1.4 0.2 setosa 3.8
4.7 3.2 1.6 0.2 setosa 3.1
4.8 3.1 1.6 0.2 setosa 3.2
5.4 3.4 1.5 0.4 setosa 3.9
5.2 4.1 1.5 0.1 setosa 3.7
5.5 4.2 1.4 0.2 setosa 4.1
4.9 3.1 1.5 0.2 setosa 3.4
5.0 3.2 1.2 0.2 setosa 3.8
5.5 3.5 1.3 0.2 setosa 4.2
4.9 3.6 1.4 0.1 setosa 3.5
4.4 3.0 1.3 0.2 setosa 3.1
5.1 3.4 1.5 0.2 setosa 3.6
5.0 3.5 1.3 0.3 setosa 3.7
4.5 2.3 1.3 0.3 setosa 3.2
4.4 3.2 1.3 0.2 setosa 3.1
5.0 3.5 1.6 0.6 setosa 3.4
5.1 3.8 1.9 0.4 setosa 3.2
4.8 3.0 1.4 0.3 setosa 3.4
5.1 3.8 1.6 0.2 setosa 3.5
4.6 3.2 1.4 0.2 setosa 3.2
5.3 3.7 1.5 0.2 setosa 3.8
5.0 3.3 1.4 0.2 setosa 3.6
7.0 3.2 4.7 1.4 versicolor 2.3
6.4 3.2 4.5 1.5 versicolor 1.9
6.9 3.1 4.9 1.5 versicolor 2.0
5.5 2.3 4.0 1.3 versicolor 1.5
6.5 2.8 4.6 1.5 versicolor 1.9
5.7 2.8 4.5 1.3 versicolor 1.2
6.3 3.3 4.7 1.6 versicolor 1.6
4.9 2.4 3.3 1.0 versicolor 1.6
6.6 2.9 4.6 1.3 versicolor 2.0
5.2 2.7 3.9 1.4 versicolor 1.3
5.0 2.0 3.5 1.0 versicolor 1.5
5.9 3.0 4.2 1.5 versicolor 1.7
6.0 2.2 4.0 1.0 versicolor 2.0
6.1 2.9 4.7 1.4 versicolor 1.4
5.6 2.9 3.6 1.3 versicolor 2.0
6.7 3.1 4.4 1.4 versicolor 2.3
5.6 3.0 4.5 1.5 versicolor 1.1
5.8 2.7 4.1 1.0 versicolor 1.7
6.2 2.2 4.5 1.5 versicolor 1.7
5.6 2.5 3.9 1.1 versicolor 1.7
5.9 3.2 4.8 1.8 versicolor 1.1
6.1 2.8 4.0 1.3 versicolor 2.1
6.3 2.5 4.9 1.5 versicolor 1.4
6.1 2.8 4.7 1.2 versicolor 1.4
6.4 2.9 4.3 1.3 versicolor 2.1
6.6 3.0 4.4 1.4 versicolor 2.2
6.8 2.8 4.8 1.4 versicolor 2.0
6.7 3.0 5.0 1.7 versicolor 1.7
6.0 2.9 4.5 1.5 versicolor 1.5
5.7 2.6 3.5 1.0 versicolor 2.2
5.5 2.4 3.8 1.1 versicolor 1.7
5.5 2.4 3.7 1.0 versicolor 1.8
5.8 2.7 3.9 1.2 versicolor 1.9
6.0 2.7 5.1 1.6 versicolor 0.9
5.4 3.0 4.5 1.5 versicolor 0.9
6.0 3.4 4.5 1.6 versicolor 1.5
6.7 3.1 4.7 1.5 versicolor 2.0
6.3 2.3 4.4 1.3 versicolor 1.9
5.6 3.0 4.1 1.3 versicolor 1.5
5.5 2.5 4.0 1.3 versicolor 1.5
5.5 2.6 4.4 1.2 versicolor 1.1
6.1 3.0 4.6 1.4 versicolor 1.5
5.8 2.6 4.0 1.2 versicolor 1.8
5.0 2.3 3.3 1.0 versicolor 1.7
5.6 2.7 4.2 1.3 versicolor 1.4
5.7 3.0 4.2 1.2 versicolor 1.5
5.7 2.9 4.2 1.3 versicolor 1.5
6.2 2.9 4.3 1.3 versicolor 1.9
5.1 2.5 3.0 1.1 versicolor 2.1
5.7 2.8 4.1 1.3 versicolor 1.6
6.3 3.3 6.0 2.5 virginica 0.3
5.8 2.7 5.1 1.9 virginica 0.7
7.1 3.0 5.9 2.1 virginica 1.2
6.3 2.9 5.6 1.8 virginica 0.7
6.5 3.0 5.8 2.2 virginica 0.7
7.6 3.0 6.6 2.1 virginica 1.0
4.9 2.5 4.5 1.7 virginica 0.4
7.3 2.9 6.3 1.8 virginica 1.0
6.7 2.5 5.8 1.8 virginica 0.9
7.2 3.6 6.1 2.5 virginica 1.1
6.5 3.2 5.1 2.0 virginica 1.4
6.4 2.7 5.3 1.9 virginica 1.1
6.8 3.0 5.5 2.1 virginica 1.3
5.7 2.5 5.0 2.0 virginica 0.7
5.8 2.8 5.1 2.4 virginica 0.7
6.4 3.2 5.3 2.3 virginica 1.1
6.5 3.0 5.5 1.8 virginica 1.0
7.7 3.8 6.7 2.2 virginica 1.0
7.7 2.6 6.9 2.3 virginica 0.8
6.0 2.2 5.0 1.5 virginica 1.0
6.9 3.2 5.7 2.3 virginica 1.2
5.6 2.8 4.9 2.0 virginica 0.7
7.7 2.8 6.7 2.0 virginica 1.0
6.3 2.7 4.9 1.8 virginica 1.4
6.7 3.3 5.7 2.1 virginica 1.0
7.2 3.2 6.0 1.8 virginica 1.2
6.2 2.8 4.8 1.8 virginica 1.4
6.1 3.0 4.9 1.8 virginica 1.2
6.4 2.8 5.6 2.1 virginica 0.8
7.2 3.0 5.8 1.6 virginica 1.4
7.4 2.8 6.1 1.9 virginica 1.3
7.9 3.8 6.4 2.0 virginica 1.5
6.4 2.8 5.6 2.2 virginica 0.8
6.3 2.8 5.1 1.5 virginica 1.2
6.1 2.6 5.6 1.4 virginica 0.5
7.7 3.0 6.1 2.3 virginica 1.6
6.3 3.4 5.6 2.4 virginica 0.7
6.4 3.1 5.5 1.8 virginica 0.9
6.0 3.0 4.8 1.8 virginica 1.2
6.9 3.1 5.4 2.1 virginica 1.5
6.7 3.1 5.6 2.4 virginica 1.1
6.9 3.1 5.1 2.3 virginica 1.8
5.8 2.7 5.1 1.9 virginica 0.7
6.8 3.2 5.9 2.3 virginica 0.9
6.7 3.3 5.7 2.5 virginica 1.0
6.7 3.0 5.2 2.3 virginica 1.5
6.3 2.5 5.0 1.9 virginica 1.3
6.5 3.0 5.2 2.0 virginica 1.3
6.2 3.4 5.4 2.3 virginica 0.8
5.9 3.0 5.1 1.8 virginica 0.8
# To keep only the newly created column, use transmute()
iris %>% 
  transmute(Length_difference = Sepal.Length - Petal.Length)
iris data: new column only
Length_difference
3.7
3.5
3.4
3.1
3.6
3.7
3.2
3.5
3.0
3.4
3.9
3.2
3.4
3.2
4.6
4.2
4.1
3.7
4.0
3.6
3.7
3.6
3.6
3.4
2.9
3.4
3.4
3.7
3.8
3.1
3.2
3.9
3.7
4.1
3.4
3.8
4.2
3.5
3.1
3.6
3.7
3.2
3.1
3.4
3.2
3.4
3.5
3.2
3.8
3.6
2.3
1.9
2.0
1.5
1.9
1.2
1.6
1.6
2.0
1.3
1.5
1.7
2.0
1.4
2.0
2.3
1.1
1.7
1.7
1.7
1.1
2.1
1.4
1.4
2.1
2.2
2.0
1.7
1.5
2.2
1.7
1.8
1.9
0.9
0.9
1.5
2.0
1.9
1.5
1.5
1.1
1.5
1.8
1.7
1.4
1.5
1.5
1.9
2.1
1.6
0.3
0.7
1.2
0.7
0.7
1.0
0.4
1.0
0.9
1.1
1.4
1.1
1.3
0.7
0.7
1.1
1.0
1.0
0.8
1.0
1.2
0.7
1.0
1.4
1.0
1.2
1.4
1.2
0.8
1.4
1.3
1.5
0.8
1.2
0.5
1.6
0.7
0.9
1.2
1.5
1.1
1.8
0.7
0.9
1.0
1.5
1.3
1.3
0.8
0.8


Interestingly, setting the value of an existing column to NULL inside mutate deletes the column.

2.3.4 rename()

As the name suggests, rename() verb changes the name of an existing column. The syntax is <new_name> = <old_name>. Example -

iris %>% 
  rename(Species.name=Species) 
iris data: Species column renamed
Sepal.Length Sepal.Width Petal.Length Petal.Width Species.name
5.1 3.5 1.4 0.2 setosa
4.9 3.0 1.4 0.2 setosa
4.7 3.2 1.3 0.2 setosa
4.6 3.1 1.5 0.2 setosa
5.0 3.6 1.4 0.2 setosa
5.4 3.9 1.7 0.4 setosa
4.6 3.4 1.4 0.3 setosa
5.0 3.4 1.5 0.2 setosa
4.4 2.9 1.4 0.2 setosa
4.9 3.1 1.5 0.1 setosa
5.4 3.7 1.5 0.2 setosa
4.8 3.4 1.6 0.2 setosa
4.8 3.0 1.4 0.1 setosa
4.3 3.0 1.1 0.1 setosa
5.8 4.0 1.2 0.2 setosa
5.7 4.4 1.5 0.4 setosa
5.4 3.9 1.3 0.4 setosa
5.1 3.5 1.4 0.3 setosa
5.7 3.8 1.7 0.3 setosa
5.1 3.8 1.5 0.3 setosa
5.4 3.4 1.7 0.2 setosa
5.1 3.7 1.5 0.4 setosa
4.6 3.6 1.0 0.2 setosa
5.1 3.3 1.7 0.5 setosa
4.8 3.4 1.9 0.2 setosa
5.0 3.0 1.6 0.2 setosa
5.0 3.4 1.6 0.4 setosa
5.2 3.5 1.5 0.2 setosa
5.2 3.4 1.4 0.2 setosa
4.7 3.2 1.6 0.2 setosa
4.8 3.1 1.6 0.2 setosa
5.4 3.4 1.5 0.4 setosa
5.2 4.1 1.5 0.1 setosa
5.5 4.2 1.4 0.2 setosa
4.9 3.1 1.5 0.2 setosa
5.0 3.2 1.2 0.2 setosa
5.5 3.5 1.3 0.2 setosa
4.9 3.6 1.4 0.1 setosa
4.4 3.0 1.3 0.2 setosa
5.1 3.4 1.5 0.2 setosa
5.0 3.5 1.3 0.3 setosa
4.5 2.3 1.3 0.3 setosa
4.4 3.2 1.3 0.2 setosa
5.0 3.5 1.6 0.6 setosa
5.1 3.8 1.9 0.4 setosa
4.8 3.0 1.4 0.3 setosa
5.1 3.8 1.6 0.2 setosa
4.6 3.2 1.4 0.2 setosa
5.3 3.7 1.5 0.2 setosa
5.0 3.3 1.4 0.2 setosa
7.0 3.2 4.7 1.4 versicolor
6.4 3.2 4.5 1.5 versicolor
6.9 3.1 4.9 1.5 versicolor
5.5 2.3 4.0 1.3 versicolor
6.5 2.8 4.6 1.5 versicolor
5.7 2.8 4.5 1.3 versicolor
6.3 3.3 4.7 1.6 versicolor
4.9 2.4 3.3 1.0 versicolor
6.6 2.9 4.6 1.3 versicolor
5.2 2.7 3.9 1.4 versicolor
5.0 2.0 3.5 1.0 versicolor
5.9 3.0 4.2 1.5 versicolor
6.0 2.2 4.0 1.0 versicolor
6.1 2.9 4.7 1.4 versicolor
5.6 2.9 3.6 1.3 versicolor
6.7 3.1 4.4 1.4 versicolor
5.6 3.0 4.5 1.5 versicolor
5.8 2.7 4.1 1.0 versicolor
6.2 2.2 4.5 1.5 versicolor
5.6 2.5 3.9 1.1 versicolor
5.9 3.2 4.8 1.8 versicolor
6.1 2.8 4.0 1.3 versicolor
6.3 2.5 4.9 1.5 versicolor
6.1 2.8 4.7 1.2 versicolor
6.4 2.9 4.3 1.3 versicolor
6.6 3.0 4.4 1.4 versicolor
6.8 2.8 4.8 1.4 versicolor
6.7 3.0 5.0 1.7 versicolor
6.0 2.9 4.5 1.5 versicolor
5.7 2.6 3.5 1.0 versicolor
5.5 2.4 3.8 1.1 versicolor
5.5 2.4 3.7 1.0 versicolor
5.8 2.7 3.9 1.2 versicolor
6.0 2.7 5.1 1.6 versicolor
5.4 3.0 4.5 1.5 versicolor
6.0 3.4 4.5 1.6 versicolor
6.7 3.1 4.7 1.5 versicolor
6.3 2.3 4.4 1.3 versicolor
5.6 3.0 4.1 1.3 versicolor
5.5 2.5 4.0 1.3 versicolor
5.5 2.6 4.4 1.2 versicolor
6.1 3.0 4.6 1.4 versicolor
5.8 2.6 4.0 1.2 versicolor
5.0 2.3 3.3 1.0 versicolor
5.6 2.7 4.2 1.3 versicolor
5.7 3.0 4.2 1.2 versicolor
5.7 2.9 4.2 1.3 versicolor
6.2 2.9 4.3 1.3 versicolor
5.1 2.5 3.0 1.1 versicolor
5.7 2.8 4.1 1.3 versicolor
6.3 3.3 6.0 2.5 virginica
5.8 2.7 5.1 1.9 virginica
7.1 3.0 5.9 2.1 virginica
6.3 2.9 5.6 1.8 virginica
6.5 3.0 5.8 2.2 virginica
7.6 3.0 6.6 2.1 virginica
4.9 2.5 4.5 1.7 virginica
7.3 2.9 6.3 1.8 virginica
6.7 2.5 5.8 1.8 virginica
7.2 3.6 6.1 2.5 virginica
6.5 3.2 5.1 2.0 virginica
6.4 2.7 5.3 1.9 virginica
6.8 3.0 5.5 2.1 virginica
5.7 2.5 5.0 2.0 virginica
5.8 2.8 5.1 2.4 virginica
6.4 3.2 5.3 2.3 virginica
6.5 3.0 5.5 1.8 virginica
7.7 3.8 6.7 2.2 virginica
7.7 2.6 6.9 2.3 virginica
6.0 2.2 5.0 1.5 virginica
6.9 3.2 5.7 2.3 virginica
5.6 2.8 4.9 2.0 virginica
7.7 2.8 6.7 2.0 virginica
6.3 2.7 4.9 1.8 virginica
6.7 3.3 5.7 2.1 virginica
7.2 3.2 6.0 1.8 virginica
6.2 2.8 4.8 1.8 virginica
6.1 3.0 4.9 1.8 virginica
6.4 2.8 5.6 2.1 virginica
7.2 3.0 5.8 1.6 virginica
7.4 2.8 6.1 1.9 virginica
7.9 3.8 6.4 2.0 virginica
6.4 2.8 5.6 2.2 virginica
6.3 2.8 5.1 1.5 virginica
6.1 2.6 5.6 1.4 virginica
7.7 3.0 6.1 2.3 virginica
6.3 3.4 5.6 2.4 virginica
6.4 3.1 5.5 1.8 virginica
6.0 3.0 4.8 1.8 virginica
6.9 3.1 5.4 2.1 virginica
6.7 3.1 5.6 2.4 virginica
6.9 3.1 5.1 2.3 virginica
5.8 2.7 5.1 1.9 virginica
6.8 3.2 5.9 2.3 virginica
6.7 3.3 5.7 2.5 virginica
6.7 3.0 5.2 2.3 virginica
6.3 2.5 5.0 1.9 virginica
6.5 3.0 5.2 2.0 virginica
6.2 3.4 5.4 2.3 virginica
5.9 3.0 5.1 1.8 virginica


Interestingly, you can change the name of a column while selecting using select() verb -

iris %>% select(Sepal.Length, 
                Sepal.Width, 
                Petal.Length, 
                Petal.Width, 
                Species.name=Species)
iris data: Species column renamed using select()
Sepal.Length Sepal.Width Petal.Length Petal.Width Species.name
5.1 3.5 1.4 0.2 setosa
4.9 3.0 1.4 0.2 setosa
4.7 3.2 1.3 0.2 setosa
4.6 3.1 1.5 0.2 setosa
5.0 3.6 1.4 0.2 setosa
5.4 3.9 1.7 0.4 setosa
4.6 3.4 1.4 0.3 setosa
5.0 3.4 1.5 0.2 setosa
4.4 2.9 1.4 0.2 setosa
4.9 3.1 1.5 0.1 setosa
5.4 3.7 1.5 0.2 setosa
4.8 3.4 1.6 0.2 setosa
4.8 3.0 1.4 0.1 setosa
4.3 3.0 1.1 0.1 setosa
5.8 4.0 1.2 0.2 setosa
5.7 4.4 1.5 0.4 setosa
5.4 3.9 1.3 0.4 setosa
5.1 3.5 1.4 0.3 setosa
5.7 3.8 1.7 0.3 setosa
5.1 3.8 1.5 0.3 setosa
5.4 3.4 1.7 0.2 setosa
5.1 3.7 1.5 0.4 setosa
4.6 3.6 1.0 0.2 setosa
5.1 3.3 1.7 0.5 setosa
4.8 3.4 1.9 0.2 setosa
5.0 3.0 1.6 0.2 setosa
5.0 3.4 1.6 0.4 setosa
5.2 3.5 1.5 0.2 setosa
5.2 3.4 1.4 0.2 setosa
4.7 3.2 1.6 0.2 setosa
4.8 3.1 1.6 0.2 setosa
5.4 3.4 1.5 0.4 setosa
5.2 4.1 1.5 0.1 setosa
5.5 4.2 1.4 0.2 setosa
4.9 3.1 1.5 0.2 setosa
5.0 3.2 1.2 0.2 setosa
5.5 3.5 1.3 0.2 setosa
4.9 3.6 1.4 0.1 setosa
4.4 3.0 1.3 0.2 setosa
5.1 3.4 1.5 0.2 setosa
5.0 3.5 1.3 0.3 setosa
4.5 2.3 1.3 0.3 setosa
4.4 3.2 1.3 0.2 setosa
5.0 3.5 1.6 0.6 setosa
5.1 3.8 1.9 0.4 setosa
4.8 3.0 1.4 0.3 setosa
5.1 3.8 1.6 0.2 setosa
4.6 3.2 1.4 0.2 setosa
5.3 3.7 1.5 0.2 setosa
5.0 3.3 1.4 0.2 setosa
7.0 3.2 4.7 1.4 versicolor
6.4 3.2 4.5 1.5 versicolor
6.9 3.1 4.9 1.5 versicolor
5.5 2.3 4.0 1.3 versicolor
6.5 2.8 4.6 1.5 versicolor
5.7 2.8 4.5 1.3 versicolor
6.3 3.3 4.7 1.6 versicolor
4.9 2.4 3.3 1.0 versicolor
6.6 2.9 4.6 1.3 versicolor
5.2 2.7 3.9 1.4 versicolor
5.0 2.0 3.5 1.0 versicolor
5.9 3.0 4.2 1.5 versicolor
6.0 2.2 4.0 1.0 versicolor
6.1 2.9 4.7 1.4 versicolor
5.6 2.9 3.6 1.3 versicolor
6.7 3.1 4.4 1.4 versicolor
5.6 3.0 4.5 1.5 versicolor
5.8 2.7 4.1 1.0 versicolor
6.2 2.2 4.5 1.5 versicolor
5.6 2.5 3.9 1.1 versicolor
5.9 3.2 4.8 1.8 versicolor
6.1 2.8 4.0 1.3 versicolor
6.3 2.5 4.9 1.5 versicolor
6.1 2.8 4.7 1.2 versicolor
6.4 2.9 4.3 1.3 versicolor
6.6 3.0 4.4 1.4 versicolor
6.8 2.8 4.8 1.4 versicolor
6.7 3.0 5.0 1.7 versicolor
6.0 2.9 4.5 1.5 versicolor
5.7 2.6 3.5 1.0 versicolor
5.5 2.4 3.8 1.1 versicolor
5.5 2.4 3.7 1.0 versicolor
5.8 2.7 3.9 1.2 versicolor
6.0 2.7 5.1 1.6 versicolor
5.4 3.0 4.5 1.5 versicolor
6.0 3.4 4.5 1.6 versicolor
6.7 3.1 4.7 1.5 versicolor
6.3 2.3 4.4 1.3 versicolor
5.6 3.0 4.1 1.3 versicolor
5.5 2.5 4.0 1.3 versicolor
5.5 2.6 4.4 1.2 versicolor
6.1 3.0 4.6 1.4 versicolor
5.8 2.6 4.0 1.2 versicolor
5.0 2.3 3.3 1.0 versicolor
5.6 2.7 4.2 1.3 versicolor
5.7 3.0 4.2 1.2 versicolor
5.7 2.9 4.2 1.3 versicolor
6.2 2.9 4.3 1.3 versicolor
5.1 2.5 3.0 1.1 versicolor
5.7 2.8 4.1 1.3 versicolor
6.3 3.3 6.0 2.5 virginica
5.8 2.7 5.1 1.9 virginica
7.1 3.0 5.9 2.1 virginica
6.3 2.9 5.6 1.8 virginica
6.5 3.0 5.8 2.2 virginica
7.6 3.0 6.6 2.1 virginica
4.9 2.5 4.5 1.7 virginica
7.3 2.9 6.3 1.8 virginica
6.7 2.5 5.8 1.8 virginica
7.2 3.6 6.1 2.5 virginica
6.5 3.2 5.1 2.0 virginica
6.4 2.7 5.3 1.9 virginica
6.8 3.0 5.5 2.1 virginica
5.7 2.5 5.0 2.0 virginica
5.8 2.8 5.1 2.4 virginica
6.4 3.2 5.3 2.3 virginica
6.5 3.0 5.5 1.8 virginica
7.7 3.8 6.7 2.2 virginica
7.7 2.6 6.9 2.3 virginica
6.0 2.2 5.0 1.5 virginica
6.9 3.2 5.7 2.3 virginica
5.6 2.8 4.9 2.0 virginica
7.7 2.8 6.7 2.0 virginica
6.3 2.7 4.9 1.8 virginica
6.7 3.3 5.7 2.1 virginica
7.2 3.2 6.0 1.8 virginica
6.2 2.8 4.8 1.8 virginica
6.1 3.0 4.9 1.8 virginica
6.4 2.8 5.6 2.1 virginica
7.2 3.0 5.8 1.6 virginica
7.4 2.8 6.1 1.9 virginica
7.9 3.8 6.4 2.0 virginica
6.4 2.8 5.6 2.2 virginica
6.3 2.8 5.1 1.5 virginica
6.1 2.6 5.6 1.4 virginica
7.7 3.0 6.1 2.3 virginica
6.3 3.4 5.6 2.4 virginica
6.4 3.1 5.5 1.8 virginica
6.0 3.0 4.8 1.8 virginica
6.9 3.1 5.4 2.1 virginica
6.7 3.1 5.6 2.4 virginica
6.9 3.1 5.1 2.3 virginica
5.8 2.7 5.1 1.9 virginica
6.8 3.2 5.9 2.3 virginica
6.7 3.3 5.7 2.5 virginica
6.7 3.0 5.2 2.3 virginica
6.3 2.5 5.0 1.9 virginica
6.5 3.0 5.2 2.0 virginica
6.2 3.4 5.4 2.3 virginica
5.9 3.0 5.1 1.8 virginica

2.3.5 arrange()

The verb arrange() arranges or orders the rows of a data-frame by the values of selected column(s), like -

iris %>% 
  arrange(Sepal.Length)
iris data: arranged by Sepal length
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
4.3 3.0 1.1 0.1 setosa
4.4 2.9 1.4 0.2 setosa
4.4 3.0 1.3 0.2 setosa
4.4 3.2 1.3 0.2 setosa
4.5 2.3 1.3 0.3 setosa
4.6 3.1 1.5 0.2 setosa
4.6 3.4 1.4 0.3 setosa
4.6 3.6 1.0 0.2 setosa
4.6 3.2 1.4 0.2 setosa
4.7 3.2 1.3 0.2 setosa
4.7 3.2 1.6 0.2 setosa
4.8 3.4 1.6 0.2 setosa
4.8 3.0 1.4 0.1 setosa
4.8 3.4 1.9 0.2 setosa
4.8 3.1 1.6 0.2 setosa
4.8 3.0 1.4 0.3 setosa
4.9 3.0 1.4 0.2 setosa
4.9 3.1 1.5 0.1 setosa
4.9 3.1 1.5 0.2 setosa
4.9 3.6 1.4 0.1 setosa
4.9 2.4 3.3 1.0 versicolor
4.9 2.5 4.5 1.7 virginica
5.0 3.6 1.4 0.2 setosa
5.0 3.4 1.5 0.2 setosa
5.0 3.0 1.6 0.2 setosa
5.0 3.4 1.6 0.4 setosa
5.0 3.2 1.2 0.2 setosa
5.0 3.5 1.3 0.3 setosa
5.0 3.5 1.6 0.6 setosa
5.0 3.3 1.4 0.2 setosa
5.0 2.0 3.5 1.0 versicolor
5.0 2.3 3.3 1.0 versicolor
5.1 3.5 1.4 0.2 setosa
5.1 3.5 1.4 0.3 setosa
5.1 3.8 1.5 0.3 setosa
5.1 3.7 1.5 0.4 setosa
5.1 3.3 1.7 0.5 setosa
5.1 3.4 1.5 0.2 setosa
5.1 3.8 1.9 0.4 setosa
5.1 3.8 1.6 0.2 setosa
5.1 2.5 3.0 1.1 versicolor
5.2 3.5 1.5 0.2 setosa
5.2 3.4 1.4 0.2 setosa
5.2 4.1 1.5 0.1 setosa
5.2 2.7 3.9 1.4 versicolor
5.3 3.7 1.5 0.2 setosa
5.4 3.9 1.7 0.4 setosa
5.4 3.7 1.5 0.2 setosa
5.4 3.9 1.3 0.4 setosa
5.4 3.4 1.7 0.2 setosa
5.4 3.4 1.5 0.4 setosa
5.4 3.0 4.5 1.5 versicolor
5.5 4.2 1.4 0.2 setosa
5.5 3.5 1.3 0.2 setosa
5.5 2.3 4.0 1.3 versicolor
5.5 2.4 3.8 1.1 versicolor
5.5 2.4 3.7 1.0 versicolor
5.5 2.5 4.0 1.3 versicolor
5.5 2.6 4.4 1.2 versicolor
5.6 2.9 3.6 1.3 versicolor
5.6 3.0 4.5 1.5 versicolor
5.6 2.5 3.9 1.1 versicolor
5.6 3.0 4.1 1.3 versicolor
5.6 2.7 4.2 1.3 versicolor
5.6 2.8 4.9 2.0 virginica
5.7 4.4 1.5 0.4 setosa
5.7 3.8 1.7 0.3 setosa
5.7 2.8 4.5 1.3 versicolor
5.7 2.6 3.5 1.0 versicolor
5.7 3.0 4.2 1.2 versicolor
5.7 2.9 4.2 1.3 versicolor
5.7 2.8 4.1 1.3 versicolor
5.7 2.5 5.0 2.0 virginica
5.8 4.0 1.2 0.2 setosa
5.8 2.7 4.1 1.0 versicolor
5.8 2.7 3.9 1.2 versicolor
5.8 2.6 4.0 1.2 versicolor
5.8 2.7 5.1 1.9 virginica
5.8 2.8 5.1 2.4 virginica
5.8 2.7 5.1 1.9 virginica
5.9 3.0 4.2 1.5 versicolor
5.9 3.2 4.8 1.8 versicolor
5.9 3.0 5.1 1.8 virginica
6.0 2.2 4.0 1.0 versicolor
6.0 2.9 4.5 1.5 versicolor
6.0 2.7 5.1 1.6 versicolor
6.0 3.4 4.5 1.6 versicolor
6.0 2.2 5.0 1.5 virginica
6.0 3.0 4.8 1.8 virginica
6.1 2.9 4.7 1.4 versicolor
6.1 2.8 4.0 1.3 versicolor
6.1 2.8 4.7 1.2 versicolor
6.1 3.0 4.6 1.4 versicolor
6.1 3.0 4.9 1.8 virginica
6.1 2.6 5.6 1.4 virginica
6.2 2.2 4.5 1.5 versicolor
6.2 2.9 4.3 1.3 versicolor
6.2 2.8 4.8 1.8 virginica
6.2 3.4 5.4 2.3 virginica
6.3 3.3 4.7 1.6 versicolor
6.3 2.5 4.9 1.5 versicolor
6.3 2.3 4.4 1.3 versicolor
6.3 3.3 6.0 2.5 virginica
6.3 2.9 5.6 1.8 virginica
6.3 2.7 4.9 1.8 virginica
6.3 2.8 5.1 1.5 virginica
6.3 3.4 5.6 2.4 virginica
6.3 2.5 5.0 1.9 virginica
6.4 3.2 4.5 1.5 versicolor
6.4 2.9 4.3 1.3 versicolor
6.4 2.7 5.3 1.9 virginica
6.4 3.2 5.3 2.3 virginica
6.4 2.8 5.6 2.1 virginica
6.4 2.8 5.6 2.2 virginica
6.4 3.1 5.5 1.8 virginica
6.5 2.8 4.6 1.5 versicolor
6.5 3.0 5.8 2.2 virginica
6.5 3.2 5.1 2.0 virginica
6.5 3.0 5.5 1.8 virginica
6.5 3.0 5.2 2.0 virginica
6.6 2.9 4.6 1.3 versicolor
6.6 3.0 4.4 1.4 versicolor
6.7 3.1 4.4 1.4 versicolor
6.7 3.0 5.0 1.7 versicolor
6.7 3.1 4.7 1.5 versicolor
6.7 2.5 5.8 1.8 virginica
6.7 3.3 5.7 2.1 virginica
6.7 3.1 5.6 2.4 virginica
6.7 3.3 5.7 2.5 virginica
6.7 3.0 5.2 2.3 virginica
6.8 2.8 4.8 1.4 versicolor
6.8 3.0 5.5 2.1 virginica
6.8 3.2 5.9 2.3 virginica
6.9 3.1 4.9 1.5 versicolor
6.9 3.2 5.7 2.3 virginica
6.9 3.1 5.4 2.1 virginica
6.9 3.1 5.1 2.3 virginica
7.0 3.2 4.7 1.4 versicolor
7.1 3.0 5.9 2.1 virginica
7.2 3.6 6.1 2.5 virginica
7.2 3.2 6.0 1.8 virginica
7.2 3.0 5.8 1.6 virginica
7.3 2.9 6.3 1.8 virginica
7.4 2.8 6.1 1.9 virginica
7.6 3.0 6.6 2.1 virginica
7.7 3.8 6.7 2.2 virginica
7.7 2.6 6.9 2.3 virginica
7.7 2.8 6.7 2.0 virginica
7.7 3.0 6.1 2.3 virginica
7.9 3.8 6.4 2.0 virginica
# After arranging the data-frame by Sepal.Length, for a distinct Sepal.Length, the Sepal.Width is arrange and so as the rest of the data-frame with it.
iris %>% 
  arrange(Sepal.Length,Sepal.Width)
iris data: arranged by Sepal length and width
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
4.3 3.0 1.1 0.1 setosa
4.4 2.9 1.4 0.2 setosa
4.4 3.0 1.3 0.2 setosa
4.4 3.2 1.3 0.2 setosa
4.5 2.3 1.3 0.3 setosa
4.6 3.1 1.5 0.2 setosa
4.6 3.2 1.4 0.2 setosa
4.6 3.4 1.4 0.3 setosa
4.6 3.6 1.0 0.2 setosa
4.7 3.2 1.3 0.2 setosa
4.7 3.2 1.6 0.2 setosa
4.8 3.0 1.4 0.1 setosa
4.8 3.0 1.4 0.3 setosa
4.8 3.1 1.6 0.2 setosa
4.8 3.4 1.6 0.2 setosa
4.8 3.4 1.9 0.2 setosa
4.9 2.4 3.3 1.0 versicolor
4.9 2.5 4.5 1.7 virginica
4.9 3.0 1.4 0.2 setosa
4.9 3.1 1.5 0.1 setosa
4.9 3.1 1.5 0.2 setosa
4.9 3.6 1.4 0.1 setosa
5.0 2.0 3.5 1.0 versicolor
5.0 2.3 3.3 1.0 versicolor
5.0 3.0 1.6 0.2 setosa
5.0 3.2 1.2 0.2 setosa
5.0 3.3 1.4 0.2 setosa
5.0 3.4 1.5 0.2 setosa
5.0 3.4 1.6 0.4 setosa
5.0 3.5 1.3 0.3 setosa
5.0 3.5 1.6 0.6 setosa
5.0 3.6 1.4 0.2 setosa
5.1 2.5 3.0 1.1 versicolor
5.1 3.3 1.7 0.5 setosa
5.1 3.4 1.5 0.2 setosa
5.1 3.5 1.4 0.2 setosa
5.1 3.5 1.4 0.3 setosa
5.1 3.7 1.5 0.4 setosa
5.1 3.8 1.5 0.3 setosa
5.1 3.8 1.9 0.4 setosa
5.1 3.8 1.6 0.2 setosa
5.2 2.7 3.9 1.4 versicolor
5.2 3.4 1.4 0.2 setosa
5.2 3.5 1.5 0.2 setosa
5.2 4.1 1.5 0.1 setosa
5.3 3.7 1.5 0.2 setosa
5.4 3.0 4.5 1.5 versicolor
5.4 3.4 1.7 0.2 setosa
5.4 3.4 1.5 0.4 setosa
5.4 3.7 1.5 0.2 setosa
5.4 3.9 1.7 0.4 setosa
5.4 3.9 1.3 0.4 setosa
5.5 2.3 4.0 1.3 versicolor
5.5 2.4 3.8 1.1 versicolor
5.5 2.4 3.7 1.0 versicolor
5.5 2.5 4.0 1.3 versicolor
5.5 2.6 4.4 1.2 versicolor
5.5 3.5 1.3 0.2 setosa
5.5 4.2 1.4 0.2 setosa
5.6 2.5 3.9 1.1 versicolor
5.6 2.7 4.2 1.3 versicolor
5.6 2.8 4.9 2.0 virginica
5.6 2.9 3.6 1.3 versicolor
5.6 3.0 4.5 1.5 versicolor
5.6 3.0 4.1 1.3 versicolor
5.7 2.5 5.0 2.0 virginica
5.7 2.6 3.5 1.0 versicolor
5.7 2.8 4.5 1.3 versicolor
5.7 2.8 4.1 1.3 versicolor
5.7 2.9 4.2 1.3 versicolor
5.7 3.0 4.2 1.2 versicolor
5.7 3.8 1.7 0.3 setosa
5.7 4.4 1.5 0.4 setosa
5.8 2.6 4.0 1.2 versicolor
5.8 2.7 4.1 1.0 versicolor
5.8 2.7 3.9 1.2 versicolor
5.8 2.7 5.1 1.9 virginica
5.8 2.7 5.1 1.9 virginica
5.8 2.8 5.1 2.4 virginica
5.8 4.0 1.2 0.2 setosa
5.9 3.0 4.2 1.5 versicolor
5.9 3.0 5.1 1.8 virginica
5.9 3.2 4.8 1.8 versicolor
6.0 2.2 4.0 1.0 versicolor
6.0 2.2 5.0 1.5 virginica
6.0 2.7 5.1 1.6 versicolor
6.0 2.9 4.5 1.5 versicolor
6.0 3.0 4.8 1.8 virginica
6.0 3.4 4.5 1.6 versicolor
6.1 2.6 5.6 1.4 virginica
6.1 2.8 4.0 1.3 versicolor
6.1 2.8 4.7 1.2 versicolor
6.1 2.9 4.7 1.4 versicolor
6.1 3.0 4.6 1.4 versicolor
6.1 3.0 4.9 1.8 virginica
6.2 2.2 4.5 1.5 versicolor
6.2 2.8 4.8 1.8 virginica
6.2 2.9 4.3 1.3 versicolor
6.2 3.4 5.4 2.3 virginica
6.3 2.3 4.4 1.3 versicolor
6.3 2.5 4.9 1.5 versicolor
6.3 2.5 5.0 1.9 virginica
6.3 2.7 4.9 1.8 virginica
6.3 2.8 5.1 1.5 virginica
6.3 2.9 5.6 1.8 virginica
6.3 3.3 4.7 1.6 versicolor
6.3 3.3 6.0 2.5 virginica
6.3 3.4 5.6 2.4 virginica
6.4 2.7 5.3 1.9 virginica
6.4 2.8 5.6 2.1 virginica
6.4 2.8 5.6 2.2 virginica
6.4 2.9 4.3 1.3 versicolor
6.4 3.1 5.5 1.8 virginica
6.4 3.2 4.5 1.5 versicolor
6.4 3.2 5.3 2.3 virginica
6.5 2.8 4.6 1.5 versicolor
6.5 3.0 5.8 2.2 virginica
6.5 3.0 5.5 1.8 virginica
6.5 3.0 5.2 2.0 virginica
6.5 3.2 5.1 2.0 virginica
6.6 2.9 4.6 1.3 versicolor
6.6 3.0 4.4 1.4 versicolor
6.7 2.5 5.8 1.8 virginica
6.7 3.0 5.0 1.7 versicolor
6.7 3.0 5.2 2.3 virginica
6.7 3.1 4.4 1.4 versicolor
6.7 3.1 4.7 1.5 versicolor
6.7 3.1 5.6 2.4 virginica
6.7 3.3 5.7 2.1 virginica
6.7 3.3 5.7 2.5 virginica
6.8 2.8 4.8 1.4 versicolor
6.8 3.0 5.5 2.1 virginica
6.8 3.2 5.9 2.3 virginica
6.9 3.1 4.9 1.5 versicolor
6.9 3.1 5.4 2.1 virginica
6.9 3.1 5.1 2.3 virginica
6.9 3.2 5.7 2.3 virginica
7.0 3.2 4.7 1.4 versicolor
7.1 3.0 5.9 2.1 virginica
7.2 3.0 5.8 1.6 virginica
7.2 3.2 6.0 1.8 virginica
7.2 3.6 6.1 2.5 virginica
7.3 2.9 6.3 1.8 virginica
7.4 2.8 6.1 1.9 virginica
7.6 3.0 6.6 2.1 virginica
7.7 2.6 6.9 2.3 virginica
7.7 2.8 6.7 2.0 virginica
7.7 3.0 6.1 2.3 virginica
7.7 3.8 6.7 2.2 virginica
7.9 3.8 6.4 2.0 virginica

2.3.6 distinct()

The distinct() verb retains only the unique/distinct rows from a data-frame given the column(s) selected and returns only the select column(s) (if not the .keep_all parameter is change from it’s default value FALSE to TRUE). Let’s see some examples -

iris %>% distinct(Sepal.Length)
iris data: distinct Sepal length
Sepal.Length
5.1
4.9
4.7
4.6
5.0
5.4
4.4
4.8
4.3
5.8
5.7
5.2
5.5
4.5
5.3
7.0
6.4
6.9
6.5
6.3
6.6
5.9
6.0
6.1
5.6
6.7
6.2
6.8
7.1
7.6
7.3
7.2
7.7
7.4
7.9
# here only the unique combinations of Sepal.Length and Sepal.Width are kept.
iris %>% distinct(Sepal.Length,Sepal.Width) 
iris data: distinct Sepal length and width only
Sepal.Length Sepal.Width
5.1 3.5
4.9 3.0
4.7 3.2
4.6 3.1
5.0 3.6
5.4 3.9
4.6 3.4
5.0 3.4
4.4 2.9
4.9 3.1
5.4 3.7
4.8 3.4
4.8 3.0
4.3 3.0
5.8 4.0
5.7 4.4
5.7 3.8
5.1 3.8
5.4 3.4
5.1 3.7
4.6 3.6
5.1 3.3
5.0 3.0
5.2 3.5
5.2 3.4
4.8 3.1
5.2 4.1
5.5 4.2
5.0 3.2
5.5 3.5
4.9 3.6
4.4 3.0
5.1 3.4
5.0 3.5
4.5 2.3
4.4 3.2
4.6 3.2
5.3 3.7
5.0 3.3
7.0 3.2
6.4 3.2
6.9 3.1
5.5 2.3
6.5 2.8
5.7 2.8
6.3 3.3
4.9 2.4
6.6 2.9
5.2 2.7
5.0 2.0
5.9 3.0
6.0 2.2
6.1 2.9
5.6 2.9
6.7 3.1
5.6 3.0
5.8 2.7
6.2 2.2
5.6 2.5
5.9 3.2
6.1 2.8
6.3 2.5
6.4 2.9
6.6 3.0
6.8 2.8
6.7 3.0
6.0 2.9
5.7 2.6
5.5 2.4
6.0 2.7
5.4 3.0
6.0 3.4
6.3 2.3
5.5 2.5
5.5 2.6
6.1 3.0
5.8 2.6
5.0 2.3
5.6 2.7
5.7 3.0
5.7 2.9
6.2 2.9
5.1 2.5
7.1 3.0
6.3 2.9
6.5 3.0
7.6 3.0
4.9 2.5
7.3 2.9
6.7 2.5
7.2 3.6
6.5 3.2
6.4 2.7
6.8 3.0
5.7 2.5
5.8 2.8
7.7 3.8
7.7 2.6
6.9 3.2
5.6 2.8
7.7 2.8
6.3 2.7
6.7 3.3
7.2 3.2
6.2 2.8
6.4 2.8
7.2 3.0
7.4 2.8
7.9 3.8
6.3 2.8
6.1 2.6
7.7 3.0
6.3 3.4
6.4 3.1
6.0 3.0
6.8 3.2
6.2 3.4
# rest of the columns are also returned.
iris %>% 
  distinct(Sepal.Length,Sepal.Width, .keep_all = T)
iris data: distinct Sepal length and width only
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
5.1 3.5 1.4 0.2 setosa
4.9 3.0 1.4 0.2 setosa
4.7 3.2 1.3 0.2 setosa
4.6 3.1 1.5 0.2 setosa
5.0 3.6 1.4 0.2 setosa
5.4 3.9 1.7 0.4 setosa
4.6 3.4 1.4 0.3 setosa
5.0 3.4 1.5 0.2 setosa
4.4 2.9 1.4 0.2 setosa
4.9 3.1 1.5 0.1 setosa
5.4 3.7 1.5 0.2 setosa
4.8 3.4 1.6 0.2 setosa
4.8 3.0 1.4 0.1 setosa
4.3 3.0 1.1 0.1 setosa
5.8 4.0 1.2 0.2 setosa
5.7 4.4 1.5 0.4 setosa
5.7 3.8 1.7 0.3 setosa
5.1 3.8 1.5 0.3 setosa
5.4 3.4 1.7 0.2 setosa
5.1 3.7 1.5 0.4 setosa
4.6 3.6 1.0 0.2 setosa
5.1 3.3 1.7 0.5 setosa
5.0 3.0 1.6 0.2 setosa
5.2 3.5 1.5 0.2 setosa
5.2 3.4 1.4 0.2 setosa
4.8 3.1 1.6 0.2 setosa
5.2 4.1 1.5 0.1 setosa
5.5 4.2 1.4 0.2 setosa
5.0 3.2 1.2 0.2 setosa
5.5 3.5 1.3 0.2 setosa
4.9 3.6 1.4 0.1 setosa
4.4 3.0 1.3 0.2 setosa
5.1 3.4 1.5 0.2 setosa
5.0 3.5 1.3 0.3 setosa
4.5 2.3 1.3 0.3 setosa
4.4 3.2 1.3 0.2 setosa
4.6 3.2 1.4 0.2 setosa
5.3 3.7 1.5 0.2 setosa
5.0 3.3 1.4 0.2 setosa
7.0 3.2 4.7 1.4 versicolor
6.4 3.2 4.5 1.5 versicolor
6.9 3.1 4.9 1.5 versicolor
5.5 2.3 4.0 1.3 versicolor
6.5 2.8 4.6 1.5 versicolor
5.7 2.8 4.5 1.3 versicolor
6.3 3.3 4.7 1.6 versicolor
4.9 2.4 3.3 1.0 versicolor
6.6 2.9 4.6 1.3 versicolor
5.2 2.7 3.9 1.4 versicolor
5.0 2.0 3.5 1.0 versicolor
5.9 3.0 4.2 1.5 versicolor
6.0 2.2 4.0 1.0 versicolor
6.1 2.9 4.7 1.4 versicolor
5.6 2.9 3.6 1.3 versicolor
6.7 3.1 4.4 1.4 versicolor
5.6 3.0 4.5 1.5 versicolor
5.8 2.7 4.1 1.0 versicolor
6.2 2.2 4.5 1.5 versicolor
5.6 2.5 3.9 1.1 versicolor
5.9 3.2 4.8 1.8 versicolor
6.1 2.8 4.0 1.3 versicolor
6.3 2.5 4.9 1.5 versicolor
6.4 2.9 4.3 1.3 versicolor
6.6 3.0 4.4 1.4 versicolor
6.8 2.8 4.8 1.4 versicolor
6.7 3.0 5.0 1.7 versicolor
6.0 2.9 4.5 1.5 versicolor
5.7 2.6 3.5 1.0 versicolor
5.5 2.4 3.8 1.1 versicolor
6.0 2.7 5.1 1.6 versicolor
5.4 3.0 4.5 1.5 versicolor
6.0 3.4 4.5 1.6 versicolor
6.3 2.3 4.4 1.3 versicolor
5.5 2.5 4.0 1.3 versicolor
5.5 2.6 4.4 1.2 versicolor
6.1 3.0 4.6 1.4 versicolor
5.8 2.6 4.0 1.2 versicolor
5.0 2.3 3.3 1.0 versicolor
5.6 2.7 4.2 1.3 versicolor
5.7 3.0 4.2 1.2 versicolor
5.7 2.9 4.2 1.3 versicolor
6.2 2.9 4.3 1.3 versicolor
5.1 2.5 3.0 1.1 versicolor
7.1 3.0 5.9 2.1 virginica
6.3 2.9 5.6 1.8 virginica
6.5 3.0 5.8 2.2 virginica
7.6 3.0 6.6 2.1 virginica
4.9 2.5 4.5 1.7 virginica
7.3 2.9 6.3 1.8 virginica
6.7 2.5 5.8 1.8 virginica
7.2 3.6 6.1 2.5 virginica
6.5 3.2 5.1 2.0 virginica
6.4 2.7 5.3 1.9 virginica
6.8 3.0 5.5 2.1 virginica
5.7 2.5 5.0 2.0 virginica
5.8 2.8 5.1 2.4 virginica
7.7 3.8 6.7 2.2 virginica
7.7 2.6 6.9 2.3 virginica
6.9 3.2 5.7 2.3 virginica
5.6 2.8 4.9 2.0 virginica
7.7 2.8 6.7 2.0 virginica
6.3 2.7 4.9 1.8 virginica
6.7 3.3 5.7 2.1 virginica
7.2 3.2 6.0 1.8 virginica
6.2 2.8 4.8 1.8 virginica
6.4 2.8 5.6 2.1 virginica
7.2 3.0 5.8 1.6 virginica
7.4 2.8 6.1 1.9 virginica
7.9 3.8 6.4 2.0 virginica
6.3 2.8 5.1 1.5 virginica
6.1 2.6 5.6 1.4 virginica
7.7 3.0 6.1 2.3 virginica
6.3 3.4 5.6 2.4 virginica
6.4 3.1 5.5 1.8 virginica
6.0 3.0 4.8 1.8 virginica
6.8 3.2 5.9 2.3 virginica
6.2 3.4 5.4 2.3 virginica

2.3.7 slice()

The slice() verb lets you index rows by their (integer) locations. It has some helpers too -

  • slice_head() selects the first row, while slice_tail() selects the last. The same can be done using slice(1) and slice(n()).

  • slice_head(<int>) selects from the first to the <int>th row, while slice_tail(<int>) selects from <int>th to the last row up to the end row.

  • slice_sample() selects rows at random.

  • slice_min() and slice_max() helper selects rows with the lowest and the highest value of the selected variable.

Few examples -

iris %>% 
  slice(1)
iris data: a random row
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
5.1 3.5 1.4 0.2 setosa
iris %>% 
  slice(10:n()) 
iris data: from 10th row to the end
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
4.9 3.1 1.5 0.1 setosa
5.4 3.7 1.5 0.2 setosa
4.8 3.4 1.6 0.2 setosa
4.8 3.0 1.4 0.1 setosa
4.3 3.0 1.1 0.1 setosa
5.8 4.0 1.2 0.2 setosa
5.7 4.4 1.5 0.4 setosa
5.4 3.9 1.3 0.4 setosa
5.1 3.5 1.4 0.3 setosa
5.7 3.8 1.7 0.3 setosa
5.1 3.8 1.5 0.3 setosa
5.4 3.4 1.7 0.2 setosa
5.1 3.7 1.5 0.4 setosa
4.6 3.6 1.0 0.2 setosa
5.1 3.3 1.7 0.5 setosa
4.8 3.4 1.9 0.2 setosa
5.0 3.0 1.6 0.2 setosa
5.0 3.4 1.6 0.4 setosa
5.2 3.5 1.5 0.2 setosa
5.2 3.4 1.4 0.2 setosa
4.7 3.2 1.6 0.2 setosa
4.8 3.1 1.6 0.2 setosa
5.4 3.4 1.5 0.4 setosa
5.2 4.1 1.5 0.1 setosa
5.5 4.2 1.4 0.2 setosa
4.9 3.1 1.5 0.2 setosa
5.0 3.2 1.2 0.2 setosa
5.5 3.5 1.3 0.2 setosa
4.9 3.6 1.4 0.1 setosa
4.4 3.0 1.3 0.2 setosa
5.1 3.4 1.5 0.2 setosa
5.0 3.5 1.3 0.3 setosa
4.5 2.3 1.3 0.3 setosa
4.4 3.2 1.3 0.2 setosa
5.0 3.5 1.6 0.6 setosa
5.1 3.8 1.9 0.4 setosa
4.8 3.0 1.4 0.3 setosa
5.1 3.8 1.6 0.2 setosa
4.6 3.2 1.4 0.2 setosa
5.3 3.7 1.5 0.2 setosa
5.0 3.3 1.4 0.2 setosa
7.0 3.2 4.7 1.4 versicolor
6.4 3.2 4.5 1.5 versicolor
6.9 3.1 4.9 1.5 versicolor
5.5 2.3 4.0 1.3 versicolor
6.5 2.8 4.6 1.5 versicolor
5.7 2.8 4.5 1.3 versicolor
6.3 3.3 4.7 1.6 versicolor
4.9 2.4 3.3 1.0 versicolor
6.6 2.9 4.6 1.3 versicolor
5.2 2.7 3.9 1.4 versicolor
5.0 2.0 3.5 1.0 versicolor
5.9 3.0 4.2 1.5 versicolor
6.0 2.2 4.0 1.0 versicolor
6.1 2.9 4.7 1.4 versicolor
5.6 2.9 3.6 1.3 versicolor
6.7 3.1 4.4 1.4 versicolor
5.6 3.0 4.5 1.5 versicolor
5.8 2.7 4.1 1.0 versicolor
6.2 2.2 4.5 1.5 versicolor
5.6 2.5 3.9 1.1 versicolor
5.9 3.2 4.8 1.8 versicolor
6.1 2.8 4.0 1.3 versicolor
6.3 2.5 4.9 1.5 versicolor
6.1 2.8 4.7 1.2 versicolor
6.4 2.9 4.3 1.3 versicolor
6.6 3.0 4.4 1.4 versicolor
6.8 2.8 4.8 1.4 versicolor
6.7 3.0 5.0 1.7 versicolor
6.0 2.9 4.5 1.5 versicolor
5.7 2.6 3.5 1.0 versicolor
5.5 2.4 3.8 1.1 versicolor
5.5 2.4 3.7 1.0 versicolor
5.8 2.7 3.9 1.2 versicolor
6.0 2.7 5.1 1.6 versicolor
5.4 3.0 4.5 1.5 versicolor
6.0 3.4 4.5 1.6 versicolor
6.7 3.1 4.7 1.5 versicolor
6.3 2.3 4.4 1.3 versicolor
5.6 3.0 4.1 1.3 versicolor
5.5 2.5 4.0 1.3 versicolor
5.5 2.6 4.4 1.2 versicolor
6.1 3.0 4.6 1.4 versicolor
5.8 2.6 4.0 1.2 versicolor
5.0 2.3 3.3 1.0 versicolor
5.6 2.7 4.2 1.3 versicolor
5.7 3.0 4.2 1.2 versicolor
5.7 2.9 4.2 1.3 versicolor
6.2 2.9 4.3 1.3 versicolor
5.1 2.5 3.0 1.1 versicolor
5.7 2.8 4.1 1.3 versicolor
6.3 3.3 6.0 2.5 virginica
5.8 2.7 5.1 1.9 virginica
7.1 3.0 5.9 2.1 virginica
6.3 2.9 5.6 1.8 virginica
6.5 3.0 5.8 2.2 virginica
7.6 3.0 6.6 2.1 virginica
4.9 2.5 4.5 1.7 virginica
7.3 2.9 6.3 1.8 virginica
6.7 2.5 5.8 1.8 virginica
7.2 3.6 6.1 2.5 virginica
6.5 3.2 5.1 2.0 virginica
6.4 2.7 5.3 1.9 virginica
6.8 3.0 5.5 2.1 virginica
5.7 2.5 5.0 2.0 virginica
5.8 2.8 5.1 2.4 virginica
6.4 3.2 5.3 2.3 virginica
6.5 3.0 5.5 1.8 virginica
7.7 3.8 6.7 2.2 virginica
7.7 2.6 6.9 2.3 virginica
6.0 2.2 5.0 1.5 virginica
6.9 3.2 5.7 2.3 virginica
5.6 2.8 4.9 2.0 virginica
7.7 2.8 6.7 2.0 virginica
6.3 2.7 4.9 1.8 virginica
6.7 3.3 5.7 2.1 virginica
7.2 3.2 6.0 1.8 virginica
6.2 2.8 4.8 1.8 virginica
6.1 3.0 4.9 1.8 virginica
6.4 2.8 5.6 2.1 virginica
7.2 3.0 5.8 1.6 virginica
7.4 2.8 6.1 1.9 virginica
7.9 3.8 6.4 2.0 virginica
6.4 2.8 5.6 2.2 virginica
6.3 2.8 5.1 1.5 virginica
6.1 2.6 5.6 1.4 virginica
7.7 3.0 6.1 2.3 virginica
6.3 3.4 5.6 2.4 virginica
6.4 3.1 5.5 1.8 virginica
6.0 3.0 4.8 1.8 virginica
6.9 3.1 5.4 2.1 virginica
6.7 3.1 5.6 2.4 virginica
6.9 3.1 5.1 2.3 virginica
5.8 2.7 5.1 1.9 virginica
6.8 3.2 5.9 2.3 virginica
6.7 3.3 5.7 2.5 virginica
6.7 3.0 5.2 2.3 virginica
6.3 2.5 5.0 1.9 virginica
6.5 3.0 5.2 2.0 virginica
6.2 3.4 5.4 2.3 virginica
5.9 3.0 5.1 1.8 virginica
iris %>% 
  slice_min( Sepal.Length)
iris data: row with the lowest sepal length
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
4.3 3 1.1 0.1 setosa

2.3.8 join

A disclaimer: there’s no verb (exactly) called join() in dplyr (at least, to date). However, there are two types of join verbs -

  • inner_join() and

  • outer_join (which is also not a verb, but a class of three verbs):

    • left_join(),

    • right_join() and

    • full_join().

Join verbs joins columns from two different data-frames based on a common key column.

inner_join() verb joins two data-frame and retains the rows where the keys match. This means that there is a potential loss of observations that we may not appreciate in the real-life analysis.

On the other hand, if we have two data-frames x and y, the left_join() verb matches the keys from x and y, while keeps all the rows from x and joins the matched rows (based on the key column) from y. The empty cells (if any) are filled with NA values. For right_join() verb, is the opposite scenario. On the other hand, the full_join() verb retains all the rows from both data-frames and empty cells are filled with NA values. Let’s clear the concept with some examples -

x <- iris %>% 
  select(Sepal.Length,Sepal.Width,Species) %>% 
  filter(Species %in% c("setosa", "versicolor")) %>% 
  slice_sample(n=10)

y <- iris %>% 
  select(Petal.Length,Petal.Width,Species) %>% 
  filter(Species %in% c("versicolor", "virginica")) %>% 
  slice_sample(n=10)

x %>% 
  inner_join(y, by = "Species")
iris data: inner_join
Sepal.Length Sepal.Width Species Petal.Length Petal.Width
6.8 2.8 versicolor 4.7 1.4
6.8 2.8 versicolor 3.9 1.1
6.8 2.8 versicolor 4.5 1.5
6.8 2.8 versicolor 4.7 1.4
5.7 2.8 versicolor 4.7 1.4
5.7 2.8 versicolor 3.9 1.1
5.7 2.8 versicolor 4.5 1.5
5.7 2.8 versicolor 4.7 1.4
5.8 2.6 versicolor 4.7 1.4
5.8 2.6 versicolor 3.9 1.1
5.8 2.6 versicolor 4.5 1.5
5.8 2.6 versicolor 4.7 1.4
6.0 2.7 versicolor 4.7 1.4
6.0 2.7 versicolor 3.9 1.1
6.0 2.7 versicolor 4.5 1.5
6.0 2.7 versicolor 4.7 1.4
6.2 2.9 versicolor 4.7 1.4
6.2 2.9 versicolor 3.9 1.1
6.2 2.9 versicolor 4.5 1.5
6.2 2.9 versicolor 4.7 1.4
x %>% 
  left_join(y, by = "Species")
iris data: left_join
Sepal.Length Sepal.Width Species Petal.Length Petal.Width
6.8 2.8 versicolor 4.7 1.4
6.8 2.8 versicolor 3.9 1.1
6.8 2.8 versicolor 4.5 1.5
6.8 2.8 versicolor 4.7 1.4
5.0 3.3 setosa NA NA
5.7 2.8 versicolor 4.7 1.4
5.7 2.8 versicolor 3.9 1.1
5.7 2.8 versicolor 4.5 1.5
5.7 2.8 versicolor 4.7 1.4
5.8 2.6 versicolor 4.7 1.4
5.8 2.6 versicolor 3.9 1.1
5.8 2.6 versicolor 4.5 1.5
5.8 2.6 versicolor 4.7 1.4
4.3 3.0 setosa NA NA
6.0 2.7 versicolor 4.7 1.4
6.0 2.7 versicolor 3.9 1.1
6.0 2.7 versicolor 4.5 1.5
6.0 2.7 versicolor 4.7 1.4
6.2 2.9 versicolor 4.7 1.4
6.2 2.9 versicolor 3.9 1.1
6.2 2.9 versicolor 4.5 1.5
6.2 2.9 versicolor 4.7 1.4
4.8 3.1 setosa NA NA
5.8 4.0 setosa NA NA
5.0 3.0 setosa NA NA
x %>% 
  right_join(y, by = "Species")
iris data: right_join
Sepal.Length Sepal.Width Species Petal.Length Petal.Width
6.8 2.8 versicolor 4.7 1.4
6.8 2.8 versicolor 3.9 1.1
6.8 2.8 versicolor 4.5 1.5
6.8 2.8 versicolor 4.7 1.4
5.7 2.8 versicolor 4.7 1.4
5.7 2.8 versicolor 3.9 1.1
5.7 2.8 versicolor 4.5 1.5
5.7 2.8 versicolor 4.7 1.4
5.8 2.6 versicolor 4.7 1.4
5.8 2.6 versicolor 3.9 1.1
5.8 2.6 versicolor 4.5 1.5
5.8 2.6 versicolor 4.7 1.4
6.0 2.7 versicolor 4.7 1.4
6.0 2.7 versicolor 3.9 1.1
6.0 2.7 versicolor 4.5 1.5
6.0 2.7 versicolor 4.7 1.4
6.2 2.9 versicolor 4.7 1.4
6.2 2.9 versicolor 3.9 1.1
6.2 2.9 versicolor 4.5 1.5
6.2 2.9 versicolor 4.7 1.4
NA NA virginica 5.3 1.9
NA NA virginica 5.6 2.4
NA NA virginica 5.7 2.5
NA NA virginica 5.6 2.2
NA NA virginica 5.1 2.0
NA NA virginica 6.6 2.1
x %>% 
  full_join(y, by = "Species")
iris data: full_join
Sepal.Length Sepal.Width Species Petal.Length Petal.Width
6.8 2.8 versicolor 4.7 1.4
6.8 2.8 versicolor 3.9 1.1
6.8 2.8 versicolor 4.5 1.5
6.8 2.8 versicolor 4.7 1.4
5.0 3.3 setosa NA NA
5.7 2.8 versicolor 4.7 1.4
5.7 2.8 versicolor 3.9 1.1
5.7 2.8 versicolor 4.5 1.5
5.7 2.8 versicolor 4.7 1.4
5.8 2.6 versicolor 4.7 1.4
5.8 2.6 versicolor 3.9 1.1
5.8 2.6 versicolor 4.5 1.5
5.8 2.6 versicolor 4.7 1.4
4.3 3.0 setosa NA NA
6.0 2.7 versicolor 4.7 1.4
6.0 2.7 versicolor 3.9 1.1
6.0 2.7 versicolor 4.5 1.5
6.0 2.7 versicolor 4.7 1.4
6.2 2.9 versicolor 4.7 1.4
6.2 2.9 versicolor 3.9 1.1
6.2 2.9 versicolor 4.5 1.5
6.2 2.9 versicolor 4.7 1.4
4.8 3.1 setosa NA NA
5.8 4.0 setosa NA NA
5.0 3.0 setosa NA NA
NA NA virginica 5.3 1.9
NA NA virginica 5.6 2.4
NA NA virginica 5.7 2.5
NA NA virginica 5.6 2.2
NA NA virginica 5.1 2.0
NA NA virginica 6.6 2.1

2.3.9 group_by() and summarise()

I will be describing group_by() and summarise() verbs together to show the effect of the former. group_by() is the most important grouping verb in dplyr. It takes one or more variables of the data-frame to group by -

iris %>% 
  group_by(Species)
iris data: group_by Species
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
5.1 3.5 1.4 0.2 setosa
4.9 3.0 1.4 0.2 setosa
4.7 3.2 1.3 0.2 setosa
4.6 3.1 1.5 0.2 setosa
5.0 3.6 1.4 0.2 setosa
5.4 3.9 1.7 0.4 setosa
4.6 3.4 1.4 0.3 setosa
5.0 3.4 1.5 0.2 setosa
4.4 2.9 1.4 0.2 setosa
4.9 3.1 1.5 0.1 setosa
5.4 3.7 1.5 0.2 setosa
4.8 3.4 1.6 0.2 setosa
4.8 3.0 1.4 0.1 setosa
4.3 3.0 1.1 0.1 setosa
5.8 4.0 1.2 0.2 setosa
5.7 4.4 1.5 0.4 setosa
5.4 3.9 1.3 0.4 setosa
5.1 3.5 1.4 0.3 setosa
5.7 3.8 1.7 0.3 setosa
5.1 3.8 1.5 0.3 setosa
5.4 3.4 1.7 0.2 setosa
5.1 3.7 1.5 0.4 setosa
4.6 3.6 1.0 0.2 setosa
5.1 3.3 1.7 0.5 setosa
4.8 3.4 1.9 0.2 setosa
5.0 3.0 1.6 0.2 setosa
5.0 3.4 1.6 0.4 setosa
5.2 3.5 1.5 0.2 setosa
5.2 3.4 1.4 0.2 setosa
4.7 3.2 1.6 0.2 setosa
4.8 3.1 1.6 0.2 setosa
5.4 3.4 1.5 0.4 setosa
5.2 4.1 1.5 0.1 setosa
5.5 4.2 1.4 0.2 setosa
4.9 3.1 1.5 0.2 setosa
5.0 3.2 1.2 0.2 setosa
5.5 3.5 1.3 0.2 setosa
4.9 3.6 1.4 0.1 setosa
4.4 3.0 1.3 0.2 setosa
5.1 3.4 1.5 0.2 setosa
5.0 3.5 1.3 0.3 setosa
4.5 2.3 1.3 0.3 setosa
4.4 3.2 1.3 0.2 setosa
5.0 3.5 1.6 0.6 setosa
5.1 3.8 1.9 0.4 setosa
4.8 3.0 1.4 0.3 setosa
5.1 3.8 1.6 0.2 setosa
4.6 3.2 1.4 0.2 setosa
5.3 3.7 1.5 0.2 setosa
5.0 3.3 1.4 0.2 setosa
7.0 3.2 4.7 1.4 versicolor
6.4 3.2 4.5 1.5 versicolor
6.9 3.1 4.9 1.5 versicolor
5.5 2.3 4.0 1.3 versicolor
6.5 2.8 4.6 1.5 versicolor
5.7 2.8 4.5 1.3 versicolor
6.3 3.3 4.7 1.6 versicolor
4.9 2.4 3.3 1.0 versicolor
6.6 2.9 4.6 1.3 versicolor
5.2 2.7 3.9 1.4 versicolor
5.0 2.0 3.5 1.0 versicolor
5.9 3.0 4.2 1.5 versicolor
6.0 2.2 4.0 1.0 versicolor
6.1 2.9 4.7 1.4 versicolor
5.6 2.9 3.6 1.3 versicolor
6.7 3.1 4.4 1.4 versicolor
5.6 3.0 4.5 1.5 versicolor
5.8 2.7 4.1 1.0 versicolor
6.2 2.2 4.5 1.5 versicolor
5.6 2.5 3.9 1.1 versicolor
5.9 3.2 4.8 1.8 versicolor
6.1 2.8 4.0 1.3 versicolor
6.3 2.5 4.9 1.5 versicolor
6.1 2.8 4.7 1.2 versicolor
6.4 2.9 4.3 1.3 versicolor
6.6 3.0 4.4 1.4 versicolor
6.8 2.8 4.8 1.4 versicolor
6.7 3.0 5.0 1.7 versicolor
6.0 2.9 4.5 1.5 versicolor
5.7 2.6 3.5 1.0 versicolor
5.5 2.4 3.8 1.1 versicolor
5.5 2.4 3.7 1.0 versicolor
5.8 2.7 3.9 1.2 versicolor
6.0 2.7 5.1 1.6 versicolor
5.4 3.0 4.5 1.5 versicolor
6.0 3.4 4.5 1.6 versicolor
6.7 3.1 4.7 1.5 versicolor
6.3 2.3 4.4 1.3 versicolor
5.6 3.0 4.1 1.3 versicolor
5.5 2.5 4.0 1.3 versicolor
5.5 2.6 4.4 1.2 versicolor
6.1 3.0 4.6 1.4 versicolor
5.8 2.6 4.0 1.2 versicolor
5.0 2.3 3.3 1.0 versicolor
5.6 2.7 4.2 1.3 versicolor
5.7 3.0 4.2 1.2 versicolor
5.7 2.9 4.2 1.3 versicolor
6.2 2.9 4.3 1.3 versicolor
5.1 2.5 3.0 1.1 versicolor
5.7 2.8 4.1 1.3 versicolor
6.3 3.3 6.0 2.5 virginica
5.8 2.7 5.1 1.9 virginica
7.1 3.0 5.9 2.1 virginica
6.3 2.9 5.6 1.8 virginica
6.5 3.0 5.8 2.2 virginica
7.6 3.0 6.6 2.1 virginica
4.9 2.5 4.5 1.7 virginica
7.3 2.9 6.3 1.8 virginica
6.7 2.5 5.8 1.8 virginica
7.2 3.6 6.1 2.5 virginica
6.5 3.2 5.1 2.0 virginica
6.4 2.7 5.3 1.9 virginica
6.8 3.0 5.5 2.1 virginica
5.7 2.5 5.0 2.0 virginica
5.8 2.8 5.1 2.4 virginica
6.4 3.2 5.3 2.3 virginica
6.5 3.0 5.5 1.8 virginica
7.7 3.8 6.7 2.2 virginica
7.7 2.6 6.9 2.3 virginica
6.0 2.2 5.0 1.5 virginica
6.9 3.2 5.7 2.3 virginica
5.6 2.8 4.9 2.0 virginica
7.7 2.8 6.7 2.0 virginica
6.3 2.7 4.9 1.8 virginica
6.7 3.3 5.7 2.1 virginica
7.2 3.2 6.0 1.8 virginica
6.2 2.8 4.8 1.8 virginica
6.1 3.0 4.9 1.8 virginica
6.4 2.8 5.6 2.1 virginica
7.2 3.0 5.8 1.6 virginica
7.4 2.8 6.1 1.9 virginica
7.9 3.8 6.4 2.0 virginica
6.4 2.8 5.6 2.2 virginica
6.3 2.8 5.1 1.5 virginica
6.1 2.6 5.6 1.4 virginica
7.7 3.0 6.1 2.3 virginica
6.3 3.4 5.6 2.4 virginica
6.4 3.1 5.5 1.8 virginica
6.0 3.0 4.8 1.8 virginica
6.9 3.1 5.4 2.1 virginica
6.7 3.1 5.6 2.4 virginica
6.9 3.1 5.1 2.3 virginica
5.8 2.7 5.1 1.9 virginica
6.8 3.2 5.9 2.3 virginica
6.7 3.3 5.7 2.5 virginica
6.7 3.0 5.2 2.3 virginica
6.3 2.5 5.0 1.9 virginica
6.5 3.0 5.2 2.0 virginica
6.2 3.4 5.4 2.3 virginica
5.9 3.0 5.1 1.8 virginica


Rather than some messages on the R Console, you don’t see any change in the structure of the iris data-frame yet. Let’s select Sepal.Length and see the effect -

iris %>% 
  group_by(Species) %>% 
  select(Sepal.Length) 
iris data: group by Species and selected by Sepal length
Species Sepal.Length
setosa 5.1
setosa 4.9
setosa 4.7
setosa 4.6
setosa 5.0
setosa 5.4
setosa 4.6
setosa 5.0
setosa 4.4
setosa 4.9
setosa 5.4
setosa 4.8
setosa 4.8
setosa 4.3
setosa 5.8
setosa 5.7
setosa 5.4
setosa 5.1
setosa 5.7
setosa 5.1
setosa 5.4
setosa 5.1
setosa 4.6
setosa 5.1
setosa 4.8
setosa 5.0
setosa 5.0
setosa 5.2
setosa 5.2
setosa 4.7
setosa 4.8
setosa 5.4
setosa 5.2
setosa 5.5
setosa 4.9
setosa 5.0
setosa 5.5
setosa 4.9
setosa 4.4
setosa 5.1
setosa 5.0
setosa 4.5
setosa 4.4
setosa 5.0
setosa 5.1
setosa 4.8
setosa 5.1
setosa 4.6
setosa 5.3
setosa 5.0
versicolor 7.0
versicolor 6.4
versicolor 6.9
versicolor 5.5
versicolor 6.5
versicolor 5.7
versicolor 6.3
versicolor 4.9
versicolor 6.6
versicolor 5.2
versicolor 5.0
versicolor 5.9
versicolor 6.0
versicolor 6.1
versicolor 5.6
versicolor 6.7
versicolor 5.6
versicolor 5.8
versicolor 6.2
versicolor 5.6
versicolor 5.9
versicolor 6.1
versicolor 6.3
versicolor 6.1
versicolor 6.4
versicolor 6.6
versicolor 6.8
versicolor 6.7
versicolor 6.0
versicolor 5.7
versicolor 5.5
versicolor 5.5
versicolor 5.8
versicolor 6.0
versicolor 5.4
versicolor 6.0
versicolor 6.7
versicolor 6.3
versicolor 5.6
versicolor 5.5
versicolor 5.5
versicolor 6.1
versicolor 5.8
versicolor 5.0
versicolor 5.6
versicolor 5.7
versicolor 5.7
versicolor 6.2
versicolor 5.1
versicolor 5.7
virginica 6.3
virginica 5.8
virginica 7.1
virginica 6.3
virginica 6.5
virginica 7.6
virginica 4.9
virginica 7.3
virginica 6.7
virginica 7.2
virginica 6.5
virginica 6.4
virginica 6.8
virginica 5.7
virginica 5.8
virginica 6.4
virginica 6.5
virginica 7.7
virginica 7.7
virginica 6.0
virginica 6.9
virginica 5.6
virginica 7.7
virginica 6.3
virginica 6.7
virginica 7.2
virginica 6.2
virginica 6.1
virginica 6.4
virginica 7.2
virginica 7.4
virginica 7.9
virginica 6.4
virginica 6.3
virginica 6.1
virginica 7.7
virginica 6.3
virginica 6.4
virginica 6.0
virginica 6.9
virginica 6.7
virginica 6.9
virginica 5.8
virginica 6.8
virginica 6.7
virginica 6.7
virginica 6.3
virginica 6.5
virginica 6.2
virginica 5.9


Though I selected only the Sepal.Length, the Species column also appears. Yes, that’s because we applied the group_by() verb beforehand. But the most dramatic effect can be seen in conjunction with the summarise() verb.

summarise() generates a new data-frame and returns one row (with the result of course) for each combination of grouping variables. In the case of no grouping variables, the output has a single row summarising all observations in the input. Now, let’s see the effect of group_by() in conjunction with summarise() verb -

iris %>% 
  group_by(Species) %>% 
  select(Sepal.Length) %>% 
  summarise(count=n())
iris data: summarised count by Species
Species count
setosa 50
versicolor 50
virginica 50
iris %>% 
  group_by(Species) %>% 
  select(Sepal.Length) %>% 
  summarise(mean_Sepal_length=mean(Sepal.Length))
iris data: Summarised mean Sepal length by Species
Species mean_Sepal_length
setosa 5.006
versicolor 5.936
virginica 6.588
# However, without any grouping -
iris %>% 
  select(Sepal.Length) %>% 
  summarise(mean_Sepal_length=mean(Sepal.Length))
iris data: summarised mean Sepal length without grouping
mean_Sepal_length
5.843333

2.4 Exercise

Now, it’s time for a mini exercise:

  1. Install the package called gapminder. You will find a dataset called gapminder. For each continent, calculate the mean of life expectancy at birth for people whose data were collected after 2002 (not inclusive). The answer will look like below -
gapminder data: summarised mean of life expectancy by continent
continent mean_LE
Oceania 80.22975
Europe 77.17460
Americas 73.01508
Asia 69.98118
Africa 54.06563
  1. Do the same for each country (instead of continent) and print the top 10 countries by life expectancy at birth. The result will look like this -
gapminder data: summarised mean of life expectancy of top 10 countries
country mean_LE
Japan 82.3015
Hong Kong, China 81.8515
Switzerland 81.1605
Iceland 81.1285
Australia 80.8025
Sweden 80.4620
Italy 80.3930
Spain 80.3605
Israel 80.2205
Canada 80.2115

3 Plotting using ggplot2

3.1 Mini intro to ggplot2

To my opinion, the most elegant package for data visualisation in R is ggplot2. Here, gg stands for the grammar of graphics. Put aside what you have learnt so far on basic R plotting techniques, ggplot2 defines the art of plotting in a whole new way. The learning curve may be steep, but once you learn it, you will fall in love with it (I promise). You provide the data, tell ggplot2 which variables to map to the aesthetics, and tell the plot type you want draw. ggplot2 will take care of the rest.

3.2 Installation

The easiest way to get ggplot2 is to install the whole tidyverse:

install.packages("tidyverse")

Alternatively, install just ggplot2:

install.packages("ggplot2")

Or the the development version from GitHub:

install.packages("devtools")
devtools::install_github("tidyverse/ggplot2")

And then, load it …

library(ggplot2)

3.3 Plotting with ggplot2

3.3.1 Difference between base R plot and ggplot2

In this chapter, I will be using the mtcars dataset for plotting different graphs. For refreshing your memory, let’s have a look at the dataset -

head(mtcars)
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
str(mtcars)
## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

Now, I will draw scatter plot, first using the base R plot() function, and then using ggplot2.

plot(x=mtcars$mpg, y=mtcars$wt)

ggplot(data = mtcars, mapping = aes(x=mpg,y=wt)) + 
  geom_point()

You can see the stark difference between them.

3.3.2 General parameters for ggplot()

For plotting with ggplot2, you start with ggplot() function and you privide the data. You then put the parameters you need to plot, like - the aesthetic mapping using mapping = aes(). Then, you add on layers (like geom_point()), scale (like scale_x_continuous()), faceting specifications (like facet_wrap()), coordinate systems (like coord_flip())

In short, these are the elements that you might see in a block of graph using ggplot() function -

  • data

  • aesthetic mapping

  • geometric objects

  • statistical transformations

  • scales

  • coordinate systems

  • position adjustments

  • faceting

You can specify different layers of the plot and combine using “+” operator. Now I will dive into different aspects of the ggplot() function -

3.3.2.1 Aesthetic mapping using aes()

Here aesthetic means something that you can see. It is mainly the mapping between a visual attribute and a variable. These are some important aesthetics -

  • position (x,y)

  • colour (basically the colour of the outer rim of the object)

  • fill (the filling-colour/inside-colour of the object)

  • shape (mainly of point)

  • line type

  • size etc

You can read all about them on your RStudio help panel by typing -

help.search("geom_", package = "ggplot2")

3.3.2.2 Geomatric Objects `geom_

There are so many geom objects in ggplot2, like -

  • geom_point()

  • geom_lines()

  • geom_boxplot()

Again, you can find those geom objects by typing in -

help.search("geom_", package = "ggplot2")

Now time to check what I have just mentioned, but before that (as usual) let’s check the data that we are going to use. I will switch to another dataset, called mpg, from R.

?mpg
3.3.2.2.1 scatter plot with geom_point()

I will now draw a scatter plot using highway miles per gallon as a function of engine displacement (in litres) -

ggplot(data=mpg, aes(x=displ, y=hwy)) + 
  geom_point()

Interestingly, you can save the whole or part of the code snippet in a variable -

# can be saved in a vector first, then print it. Like -
p1 <- ggplot(data=mpg, aes(x=displ, y=hwy)) + geom_point()
# now invoke it
p1

# or 
p <- ggplot(data=mpg, aes(x=displ, y=hwy)) # saved as a base plot variable. I will call p and add different layer on it.
p2 <- p + geom_point() 
p3 <- p + geom_line()
p4 <- p + geom_smooth()
p5 <- p2 + geom_smooth(se = F, linetype="dashed")
p5
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

Now let’s play with colour and size -

p + geom_point(colour="red", alpha = 0.2, size = 3) # outside aes(), affects the same for all

p + geom_point(aes(colour=year, shape=factor(cyl)), size = 3) # inside aes(), affects accordingly

If you want to play with different shades of colours in your plots, This is a good place to start. The default colour scheme is not colour-blind friendly. You can even find a colour-blind-friendly colour palette following this link.

You can play with title and axis labels -

p + 
  geom_point(aes(colour=year), size = 3, alpha = 0.2) +
  #geom_text(aes(label=model)) + # may be not a good idea now.
  labs(
    title = "Fuel efficiency vs Engine displacement",
    subtitle = "Fuel efficiency decreases with the engine size",
    caption = "Two-seater is an exception",
    x = "Engine displacement (L)",
    y = "Highway fule economy (mpg)",
    colour = "Manufactrure year"
  )

If your datapoints are a bit tightly spaced, you can jitter a bit -

p + 
  geom_point(aes(colour=class), size = 3, position = "jitter") # introducing jitter here. For controlling the amount of movements, you can use geom_jitter()

Let’s play with some scaling -

p + 
  geom_point(aes(colour=class), size = 3, alpha = 0.2) +
  scale_x_continuous(name = "x-axis label changed", breaks = seq(0,10,by=5),limits = c(0,10)) +
  scale_y_continuous(trans = "reverse")

p + 
  geom_point(aes(colour=class), size = 3, alpha = 0.2) +
  scale_colour_brewer(palette = "Set1") # scale_colour is a widely used one

You can play with the positioning of the legend, too -

p + 
  geom_point(aes(colour=class), size = 3, alpha = 0.2) +
  theme(legend.position = "left")

p + 
  geom_point(aes(colour=class), size = 3, alpha = 0.2) +
  theme(legend.position = "none") 

3.3.2.3 Coordinate system

I will discuss it with box polt later in this chapter.

3.3.2.4 Faceting

If you have too many data points, the idea of faceting is to sub-setting the plot by an appropriate variable -

p + 
  geom_point(aes(colour=class), size = 3, alpha = 0.2) +
  facet_wrap(~ class, ncol = 2)

p + 
  geom_point(aes(colour=class), size = 3, alpha = 0.2) +
  facet_grid(~ class) # if there were any blank plot, won't be plotted here

3.3.2.5 Theme

There are different themes to play with -

p + 
  geom_point(aes(colour=class), size = 3, alpha = 0.2) +
  theme_void()

3.3.2.6 Other geometric objects

3.3.2.6.1 Bar plot and position adjustment

By default, the bar plot comes as stacked. If you fill it by a variable that is not used to plot the bars, you can see what I mean. However, for playing with the bar plot, I will be using another dataset called ‘diamonds’ that comes with R.

To begin with -

ggplot(data=diamonds) +
  geom_bar(mapping = aes(x=cut))

ggplot(data=diamonds) +
  geom_bar(mapping = aes(x=cut, fill=cut))

But -

ggplot(data=diamonds) +
  geom_bar(mapping = aes(x=cut, fill=clarity))

The position is adjusted by the position argument which takes in three options - “identity”, “fill”, and “dodge”

ggplot(data=diamonds) +
  geom_bar(mapping = aes(x=cut, fill=clarity), position = "identity")

Here, each object falls exactly where it should be in the context of the plot and seems to be overlapped. It can be a little better if you use fill = NA or use alpha value

ggplot(data=diamonds) +
  geom_bar(mapping = aes(x=cut, fill=clarity), position = "identity", alpha = 0.2)

ggplot(data=diamonds) +
  geom_bar(mapping = aes(x=cut, colour=clarity), position = "identity", fill=NA) # mind the change of colour and fill

Position fill catches up all the space vertically for each bar and displays as fraction of the values

ggplot(data=diamonds) +
  geom_bar(mapping = aes(x=cut, fill=clarity), position = "fill")

But what we usually mean by the bar plots is the next -

ggplot(data=diamonds) +
  geom_bar(mapping = aes(x=cut, fill=clarity), position = "dodge")

3.3.2.6.2 Boxplot

Box plot is very convenient to see the distribution of your data and compare side by side the distributions of different variables in your data -

ggplot(mpg, aes(class, hwy)) +
  geom_boxplot() +
  coord_flip()

ggplot(mpg, aes(class, hwy)) +
  geom_boxplot() +
  coord_polar()

# Please don't plot boxplot in this way in real-life.

3.4 Plotting exercise

Let’s re-construct this plot. There is an interesting reason behind my backward approach. Mentioning the dataset and variables, I asked ChatGPT to write a code snippet, and it did something close to what I wanted. Now, I want you to start from the beginning. Here are some info that you will need -

  • You will need the midwest dataset that comes with the ggplot2 package.

  • Using geom_point() verb, draw scatter plot using the variables area and poptotal.

  • Colour the points by state, and set the size of them by variable popdensity.

  • Use geom_smooth() verb to visualise the relationship between variables area and poptotal using loess method. Get rid of the confidence interval around the smooth line.

  • Adjust the x- and y-axis accordingly.

  • Annotate the plot accordingly.

4 Metabric data analysis

Now it’s our turn to apply the techniques that we have learned so far in this workshop. In this section, we will explore some datasets that were part of a study characterising the genomic mutations (SNVs and CNAs) and gene expression profiles for over 2000 primary breast tumours. In addition, a detailed clinical information can also be found for this study alongside the experimental data from cBioPortal. The study was published under two prominent publications -

Curtis et al., Nature 486:346-52, 2012

Pereira et al., Nature Communications 7:11479, 2016

FYI, the gene expression data generated using microarrays, genome-wide copy number profiles were obtained using SNP microarrays and targeted sequencing was performed using a panel of 40 driver-mutation genes to detect mutations (single nucleotide variants).

Let’s download the data and save it in a folder (if you have not done it already). We will be plotting different aspects of the patient related information in our exploratory data analysis (EDA) workshop today. And for that, we will merge and format the data provided.

Now, let’s load the data one by one using the function read.delim with appropriate parameters -

library(dplyr)
library(ggplot2)


# Load patient data and explore a few of the columns (e.g. BREAST_SURGERY, CELLULARITY,CHEMOTHERAPY, ER_IHC ) -
patient_data <- read.delim("/Users/mahedi/Documents/Collaborations/UCL_CI/metabric/brca_metabric/data_clinical_patient.txt",comment.char = "#", sep = "\t")

patient_data %>% pull(BREAST_SURGERY) %>% table
## .
##                   BREAST CONSERVING        MASTECTOMY 
##               554               785              1170
patient_data %>% pull(CELLULARITY) %>% table
## .
##              High      Low Moderate 
##      592      965      215      737
patient_data %>% pull(CHEMOTHERAPY) %>% table
## .
##        NO  YES 
##  529 1568  412
patient_data %>% pull(ER_IHC) %>% table
## .
##          Negative  Positve 
##       83      609     1817
# Load sample data and explore the ER_STATUS
sample_data <- read.delim("/Users/mahedi/Documents/Collaborations/UCL_CI/metabric/brca_metabric/data_clinical_sample.txt",comment.char = "#", sep = "\t")

sample_data %>% pull(ER_STATUS) %>% table
## .
## Negative Positive 
##      644     1825
# Load CNA data and explore
CNA_data <- read.table("/Users/mahedi/Documents/Collaborations/UCL_CI/metabric/brca_metabric/data_cna.txt",header = T, sep = "\t") %>%
  select(-Entrez_Gene_Id) %>%
  distinct(Hugo_Symbol, .keep_all = T)

CNA_data[1:10, 1:10]
##    Hugo_Symbol MB.0000 MB.0039 MB.0045 MB.0046 MB.0048 MB.0050 MB.0053 MB.0062
## 1         A1BG       0       0      -1       0       0       0       0      -1
## 2     A1BG-AS1       0       0      -1       0       0       0       0      -1
## 3         A1CF       0       0       0       0       1       0       0       0
## 4          A2M       0       0      -1      -1       0       0       0       2
## 5      A2M-AS1       0       0      -1      -1       0       0       0       2
## 6        A2ML1       0       0      -1      -1       0       0       0       2
## 7        A2MP1       0       0      -1      -1       0       0       0       2
## 8      A3GALT2       0       0       0       0       0       0       0      -1
## 9       A4GALT       0       0       0      -1      -1      -1       0       1
## 10       A4GNT       0       0       2       0       0       0       1       1
##    MB.0064
## 1        0
## 2        0
## 3        0
## 4        0
## 5        0
## 6        0
## 7        0
## 8        0
## 9        0
## 10       0
# Load mutation data and explore
mutation_data <- read.delim("/Users/mahedi/Documents/Collaborations/UCL_CI/metabric/brca_metabric/data_mutations.txt",comment.char = "#", sep = "\t") 

mutation_data %>% head()
##   Hugo_Symbol Entrez_Gene_Id   Center NCBI_Build Chromosome Start_Position
## 1        TP53             NA METABRIC     GRCh37         17        7579344
## 2        TP53             NA METABRIC     GRCh37         17        7579346
## 3       MLLT4             NA METABRIC     GRCh37          6      168299111
## 4         NF2             NA METABRIC     GRCh37         22       29999995
## 5       SF3B1             NA METABRIC     GRCh37          2      198288682
## 6        NT5E             NA METABRIC     GRCh37          6       86195125
##   End_Position Strand              Consequence Variant_Classification
## 1      7579345      +       frameshift_variant        Frame_Shift_Ins
## 2      7579347      + protein_altering_variant           In_Frame_Ins
## 3    168299111      +         missense_variant      Missense_Mutation
## 4     29999995      +         missense_variant      Missense_Mutation
## 5    198288682      +       synonymous_variant                 Silent
## 6     86195125      +       synonymous_variant                 Silent
##   Variant_Type Reference_Allele Tumor_Seq_Allele1 Tumor_Seq_Allele2 dbSNP_RS
## 1          INS                -                 -                 G       NA
## 2          INS                -                 -               CAG       NA
## 3          SNP                G                 G                 T       NA
## 4          SNP                G                 G                 T       NA
## 5          SNP                A                 A                 T       NA
## 6          SNP                T                 T                 C       NA
##   dbSNP_Val_Status Tumor_Sample_Barcode Matched_Norm_Sample_Barcode
## 1               NA            MTS-T0058                          NA
## 2               NA            MTS-T0058                          NA
## 3               NA            MTS-T0058                          NA
## 4               NA            MTS-T0058                          NA
## 5               NA            MTS-T0059                          NA
## 6               NA            MTS-T0059                          NA
##   Match_Norm_Seq_Allele1 Match_Norm_Seq_Allele2 Tumor_Validation_Allele1
## 1                     NA                     NA                       NA
## 2                     NA                     NA                       NA
## 3                     NA                     NA                       NA
## 4                     NA                     NA                       NA
## 5                     NA                     NA                       NA
## 6                     NA                     NA                       NA
##   Tumor_Validation_Allele2 Match_Norm_Validation_Allele1
## 1                       NA                            NA
## 2                       NA                            NA
## 3                       NA                            NA
## 4                       NA                            NA
## 5                       NA                            NA
## 6                       NA                            NA
##   Match_Norm_Validation_Allele2 Verification_Status Validation_Status
## 1                            NA                  NA                NA
## 2                            NA                  NA                NA
## 3                            NA                  NA                NA
## 4                            NA                  NA                NA
## 5                            NA                  NA                NA
## 6                            NA                  NA                NA
##   Mutation_Status Sequencing_Phase Sequence_Source Validation_Method Score
## 1              NA               NA              NA                NA    NA
## 2              NA               NA              NA                NA    NA
## 3              NA               NA              NA                NA    NA
## 4              NA               NA              NA                NA    NA
## 5              NA               NA              NA                NA    NA
## 6              NA               NA              NA                NA    NA
##   BAM_File            Sequencer t_ref_count t_alt_count n_ref_count n_alt_count
## 1       NA Illumina HiSeq 2,000          NA          NA          NA          NA
## 2       NA Illumina HiSeq 2,000          NA          NA          NA          NA
## 3       NA Illumina HiSeq 2,000          NA          NA          NA          NA
## 4       NA Illumina HiSeq 2,000          NA          NA          NA          NA
## 5       NA Illumina HiSeq 2,000          NA          NA          NA          NA
## 6       NA Illumina HiSeq 2,000          NA          NA          NA          NA
##                               HGVSc                HGVSp    HGVSp_Short
## 1        ENST00000269305.4:c.343dup   p.His115ProfsTer34   p.H115Pfs*34
## 2 ENST00000269305.4:c.340_341insCTG p.Leu114delinsSerVal p.L114delinsSV
## 3       ENST00000392108.3:c.1544G>T          p.Gly515Val        p.G515V
## 4          ENST00000338641.4:c.8G>T            p.Gly3Val          p.G3V
## 5         ENST00000335508.6:c.45T>A             p.Ile15=         p.I15=
## 6        ENST00000257770.3:c.924T>C            p.Ile308=        p.I308=
##     Transcript_ID         RefSeq Protein_position     Codons Hotspot
## 1 ENST00000269305 NM_001126112.2              114        -/C       0
## 2 ENST00000269305 NM_001126112.2              114 ttg/tCTGtg       0
## 3 ENST00000392108 NM_001040000.2              515    gGa/gTa       0
## 4 ENST00000338641    NM_000268.3                3    gGg/gTg       0
## 5 ENST00000335508    NM_012433.2               15    atT/atA       0
## 6 ENST00000257770    NM_002526.3              308    atT/atC       0
# Load expression data and explore
expression_data <- read.delim("/Users/mahedi/Documents/Collaborations/UCL_CI/metabric/brca_metabric/data_mrna_agilent_microarray.txt",comment.char = "#", sep = "\t", header = T)

expression_data[1:10, 1:10]
##    Hugo_Symbol Entrez_Gene_Id  MB.0362  MB.0346   MB.0386  MB.0574  MB.0185
## 1         RERE            473 8.676978 9.653589  9.033589 8.814855 8.736406
## 2       RNF165         494470 6.075331 6.687887  5.910885 5.628740 6.392422
## 3         PHF7          51533 5.838270 5.600876  6.030718 5.849428 5.542133
## 4        CIDEA           1149 6.397503 5.246319 10.111816 6.116868 5.184098
## 5        TENT2         167153 7.906217 8.267256  7.959291 9.206376 8.162845
## 6      SLC17A3          10786 5.702379 5.521794  5.689533 5.439130 5.464326
## 7          SDS          10993 6.930741 6.141689  6.529312 6.430102 6.105427
## 8     ATP6V1C2         245973 5.332863 7.563477  5.482155 5.398675 5.026018
## 9           F3           2152 5.275676 5.376381  5.463788 5.409761 5.338580
## 10      FAM71C         196472 5.443896 5.319857  5.254294 5.512298 5.430874
##     MB.0503  MB.0641  MB.0201
## 1  9.274265 9.286585 8.437347
## 2  5.908698 6.206729 6.095592
## 3  5.964661 5.783374 5.737572
## 4  7.828171 8.744149 5.480091
## 5  8.706646 8.518929 7.478413
## 6  5.417484 5.629885 5.686286
## 7  6.684893 5.632753 5.866132
## 8  5.266674 5.701353 6.403136
## 9  5.490693 5.363266 6.341856
## 10 5.363378 5.191612 5.208379

To begin with, let’s explore the mutation data a bit by plotting the frequency of different types of mutations -

head(mutation_data)
##   Hugo_Symbol Entrez_Gene_Id   Center NCBI_Build Chromosome Start_Position
## 1        TP53             NA METABRIC     GRCh37         17        7579344
## 2        TP53             NA METABRIC     GRCh37         17        7579346
## 3       MLLT4             NA METABRIC     GRCh37          6      168299111
## 4         NF2             NA METABRIC     GRCh37         22       29999995
## 5       SF3B1             NA METABRIC     GRCh37          2      198288682
## 6        NT5E             NA METABRIC     GRCh37          6       86195125
##   End_Position Strand              Consequence Variant_Classification
## 1      7579345      +       frameshift_variant        Frame_Shift_Ins
## 2      7579347      + protein_altering_variant           In_Frame_Ins
## 3    168299111      +         missense_variant      Missense_Mutation
## 4     29999995      +         missense_variant      Missense_Mutation
## 5    198288682      +       synonymous_variant                 Silent
## 6     86195125      +       synonymous_variant                 Silent
##   Variant_Type Reference_Allele Tumor_Seq_Allele1 Tumor_Seq_Allele2 dbSNP_RS
## 1          INS                -                 -                 G       NA
## 2          INS                -                 -               CAG       NA
## 3          SNP                G                 G                 T       NA
## 4          SNP                G                 G                 T       NA
## 5          SNP                A                 A                 T       NA
## 6          SNP                T                 T                 C       NA
##   dbSNP_Val_Status Tumor_Sample_Barcode Matched_Norm_Sample_Barcode
## 1               NA            MTS-T0058                          NA
## 2               NA            MTS-T0058                          NA
## 3               NA            MTS-T0058                          NA
## 4               NA            MTS-T0058                          NA
## 5               NA            MTS-T0059                          NA
## 6               NA            MTS-T0059                          NA
##   Match_Norm_Seq_Allele1 Match_Norm_Seq_Allele2 Tumor_Validation_Allele1
## 1                     NA                     NA                       NA
## 2                     NA                     NA                       NA
## 3                     NA                     NA                       NA
## 4                     NA                     NA                       NA
## 5                     NA                     NA                       NA
## 6                     NA                     NA                       NA
##   Tumor_Validation_Allele2 Match_Norm_Validation_Allele1
## 1                       NA                            NA
## 2                       NA                            NA
## 3                       NA                            NA
## 4                       NA                            NA
## 5                       NA                            NA
## 6                       NA                            NA
##   Match_Norm_Validation_Allele2 Verification_Status Validation_Status
## 1                            NA                  NA                NA
## 2                            NA                  NA                NA
## 3                            NA                  NA                NA
## 4                            NA                  NA                NA
## 5                            NA                  NA                NA
## 6                            NA                  NA                NA
##   Mutation_Status Sequencing_Phase Sequence_Source Validation_Method Score
## 1              NA               NA              NA                NA    NA
## 2              NA               NA              NA                NA    NA
## 3              NA               NA              NA                NA    NA
## 4              NA               NA              NA                NA    NA
## 5              NA               NA              NA                NA    NA
## 6              NA               NA              NA                NA    NA
##   BAM_File            Sequencer t_ref_count t_alt_count n_ref_count n_alt_count
## 1       NA Illumina HiSeq 2,000          NA          NA          NA          NA
## 2       NA Illumina HiSeq 2,000          NA          NA          NA          NA
## 3       NA Illumina HiSeq 2,000          NA          NA          NA          NA
## 4       NA Illumina HiSeq 2,000          NA          NA          NA          NA
## 5       NA Illumina HiSeq 2,000          NA          NA          NA          NA
## 6       NA Illumina HiSeq 2,000          NA          NA          NA          NA
##                               HGVSc                HGVSp    HGVSp_Short
## 1        ENST00000269305.4:c.343dup   p.His115ProfsTer34   p.H115Pfs*34
## 2 ENST00000269305.4:c.340_341insCTG p.Leu114delinsSerVal p.L114delinsSV
## 3       ENST00000392108.3:c.1544G>T          p.Gly515Val        p.G515V
## 4          ENST00000338641.4:c.8G>T            p.Gly3Val          p.G3V
## 5         ENST00000335508.6:c.45T>A             p.Ile15=         p.I15=
## 6        ENST00000257770.3:c.924T>C            p.Ile308=        p.I308=
##     Transcript_ID         RefSeq Protein_position     Codons Hotspot
## 1 ENST00000269305 NM_001126112.2              114        -/C       0
## 2 ENST00000269305 NM_001126112.2              114 ttg/tCTGtg       0
## 3 ENST00000392108 NM_001040000.2              515    gGa/gTa       0
## 4 ENST00000338641    NM_000268.3                3    gGg/gTg       0
## 5 ENST00000335508    NM_012433.2               15    atT/atA       0
## 6 ENST00000257770    NM_002526.3              308    atT/atC       0
ggplot(data=mutation_data,mapping = aes(Variant_Classification, fill=Variant_Classification)) + 
  geom_bar() + 
  coord_flip()

Now we will build a word cloud of genes that had been affected by mutations -

# install.packages("wordcloud")
library(wordcloud)
## Loading required package: RColorBrewer
# We need the gene name and how many times they are affected by any non-synonymous mutation -
mutation_wordcloud_data <- mutation_data %>%
  filter(Consequence != "synonymous_variant") %>%
  group_by(Hugo_Symbol) %>% 
  summarise(freq=n()) %>% 
  rename(word=Hugo_Symbol)

mutation_wordcloud_data %>% head
## # A tibble: 6 × 2
##   word    freq
##   <chr>  <int>
## 1 ACVRL1    13
## 2 AFF2      44
## 3 AGMO      32
## 4 AGTR2     14
## 5 AHNAK    246
## 6 AHNAK2   537
# Let's find out some highly affected genes - 
ggplot(mutation_wordcloud_data %>% filter(freq > 100)) +
  geom_col(aes(word, freq)) +
  coord_flip()

# Now create the word cloud
wordcloud(word=mutation_wordcloud_data %>% pull(word),
          freq = mutation_wordcloud_data %>% pull(freq),
          scale=c(5,0.5),     # Set min and max scale
          max.words=100,      # Set top n words
          random.order=FALSE, # Words in decreasing freq
          rot.per=0.35,       # % of vertical words
          use.r.layout=T, # Use C++ collision detection
          colors=brewer.pal(8, "Dark2"))

Now, we will subset the loaded data so that we can merge (or join) them together later. We will create new dataset containing -

And, we will combine all the data based on the patient_ID to create a master dataset that we will use in the rest of the worshop.

# Find out the frequency of mutations per patient
mutation_per_patient <- mutation_data %>%
  filter(Consequence != "synonymous_variant") %>%
  pull(Tumor_Sample_Barcode) %>%
  table() %>%
  data.frame() %>% 
  select(patient_ID = ".", Mutation_count=Freq)



# subsetting and formatting the expression data 
sub_expression_data <- expression_data %>% 
  filter(Hugo_Symbol %in% c("GATA3","FOXA1","MLPH","ESR1","ERBB2","PGR","TP53","PIK3CA",
                            "AKT1", "PTEN", "PIK3R1", "FOXO3","RB1", "KMT2C", "ARID1A",
                            "NCOR1","CTCF","MAP3K1","NF1","CDH1","TBX3","CBFB","RUNX1",
                            "USP9X","SF3B1"))

rm(expression_data)

rownames(sub_expression_data) <- sub_expression_data$Hugo_Symbol

sub_expression_data <- sub_expression_data %>%
  select(-Hugo_Symbol,-Entrez_Gene_Id) %>%
  t() %>%
  data.frame() %>%
  mutate(patient_ID = rownames(.))


# subsetting the sample_data

sub_sample_data <- sample_data %>% 
  select(patient_ID = PATIENT_ID,
         sample_ID = SAMPLE_ID,
         cancer_type = CANCER_TYPE,
         cancer_type_detailed = CANCER_TYPE_DETAILED,
         ER_status = ER_STATUS,
         HER2_status = HER2_STATUS,
         PR_status = PR_STATUS,
         Neoplasm_Histologic_Grade = GRADE)

rm(sample_data)

# subsetting the patient data 
sub_patient_data <- patient_data %>%
   select(patient_ID = PATIENT_ID,
          Three_gene_classifier_subtype = THREEGENE,
          Age_at_diagnosis = AGE_AT_DIAGNOSIS,
          Cellularity = CELLULARITY,
          Chemotherapy = CHEMOTHERAPY,
          ER_status_measured_by_IHC = ER_IHC,
          Hormone_therapy = HORMONE_THERAPY,
          Integrative_cluster = INTCLUST,
          Nottingham_prognostic_index = NPI,
          PAM50 = CLAUDIN_SUBTYPE)
 


# let's combine the dataset 
combined_data <- left_join(sub_patient_data,sub_sample_data, by="patient_ID")
combined_data <- left_join(combined_data, mutation_per_patient, by="patient_ID")
 
combined_data$patient_ID <- gsub("-",".",combined_data$patient_ID) # replace the '-' sign to '.' in the patient_ID column

combined_data <- left_join(combined_data,sub_expression_data, by="patient_ID")

Now, we will generate a scatter plot using the expression data of Estrogen receptor ESR1 against that of transcription factor GATA3. Then we will build our understanding of their co-expression by building a linear model (on the plot, of course). We will then refine that based on the ER_status (positive or negative) -

ggplot(data = combined_data) +
  geom_point(mapping = aes(x = GATA3, y = ESR1))
## Warning: Removed 529 rows containing missing values (`geom_point()`).

ggplot(data = combined_data %>% na.omit(),  aes(x = GATA3, y = ESR1)) +
  geom_point() + 
  geom_smooth(method = "lm")
## `geom_smooth()` using formula = 'y ~ x'

ggplot(data = combined_data %>% na.omit()) +
  geom_point(mapping = aes(x = GATA3, y = ESR1, colour = ER_status))

ggplot(data = combined_data %>% na.omit(),  aes(x = GATA3, y = ESR1, colour = ER_status)) +
  geom_point() + 
  geom_smooth(method = "lm")
## `geom_smooth()` using formula = 'y ~ x'

On a different note, GATA3 expression is ususally high in Luminal A subtype of breast cancer and also in tumour with positive estrogen receptor (ER+) status (Voduc D et. al.). Let’s find out if that’s try for this study -

# GATA3 expression in PAM50 classified tumour types-
ggplot(combined_data, aes(PAM50, GATA3)) + 
  geom_boxplot()
## Warning: Removed 529 rows containing non-finite values (`stat_boxplot()`).

# GATA3 expression in tumour with different ER status (positive and negative)-
ggplot(combined_data %>% na.omit(), aes(ER_status, GATA3)) + 
  geom_boxplot()

ggplot(combined_data %>% na.omit(), aes(ER_status, GATA3)) + 
  geom_violin(aes(fill=ER_status))

Now, we will look at the distribution of age of the patients at diagnosis as a function of some selected mutated genes.

mut_gene <- mutation_data %>%
  filter(Consequence != "synonymous_variant") %>%
  select(gene=Hugo_Symbol,patient_ID=Tumor_Sample_Barcode )

patient_age <- patient_data %>% select(age=AGE_AT_DIAGNOSIS,patient_ID=PATIENT_ID)

plot_data <- left_join(mut_gene,patient_age,by="patient_ID") %>%
  filter(gene %in% c("PIK3CA", "TP53", "GATA3", "CDH1", "MAP3K1", "CBFB", "SF3B1")) %>%
  mutate(age_cat = case_when(age < 45 ~ "<45",
                             age >= 45 & age <= 54 ~ "45-54",
                             age >= 55 & age <= 64 ~ "55-64",
                             age > 64  ~ ">64",)) %>%
  na.omit()

plot_data$age_cat <- factor(plot_data$age_cat, ordered = T, levels = c(">64","55-64","45-54","<45"))

plot_data %>%
  group_by(gene,age_cat) %>%
  select(gene,age_cat) %>% 
  summarise(freq=n()) %>%
  ggplot() +
  geom_col(aes(gene,freq, fill=age_cat), position="fill", colour="black") +
  scale_fill_manual(values=c("#568a48","#6fad76","#aac987","#e6ede3")) +
  theme_classic()
## `summarise()` has grouped output by 'gene'. You can override using the
## `.groups` argument.

Can we distinguish any pattern from the plot?

Now, we will try to explore patterns of co-occurring mutations and mutual exclusivity in a set of 21 driver genes (so-called Mut-driver genes) -

#install.packages("splitstackshape")
#install.packages("reshape2")
library(splitstackshape)
library(reshape2)

# create a matrix for the combination of the frequency of mutated genes and each patient
mat <- t(splitstackshape:::charMat(listOfValues = split( mut_gene$gene,mut_gene$patient_ID), fill = 0L))

# set of 21 Mut-driver genes
mat_gene <- c("PIK3CA","AKT1","PTEN","PIK3R1","FOXO3", "RB1", "KMT2C", "ARID1A","NCOR1","CTCF", "TP53", "MAP3K1", "NF1","CDH1","GATA3","TBX3","CBFB","RUNX1","ERBB2","USP9X","SF3B1")

# create an empty matrix 
mat_asso <- matrix(data=NA, nrow = length(mat_gene), ncol = length(mat_gene))
colnames(mat_asso) <- mat_gene
rownames(mat_asso) <- mat_gene

# fill in the cells with log odds ratio for each pairwise association test
for(i in mat_gene){
  for(j in mat_gene){
    mat_asso[i,j] <- fisher.test(mat[i,],mat[j,])$estimate %>% log()
  }

}

# get rid of a triangular half of the matrix
mat_asso[upper.tri(mat_asso, diag = T)] <- 0


ggplot(melt(mat_asso), aes(Var1,Var2)) +
  geom_tile(aes(fill=value), colour="white") +
  scale_fill_gradient2(low = "#7c4d91", high = "#5e8761",mid = "white", limits = c(-2,2)) +
  labs(title = "Patterns of association between somatic events",
       caption = "Purple squares represent negative associations (mutually exclusive mutations).\nGreen squares represent positively associated events (co-mutation).\nThe colour scale represents the magnitude of the association (log odds)",
       x="",
       y="",
       fill= "Log odds")+
  theme_classic() +
  coord_flip() +
  theme(axis.text.x = element_blank(),
        axis.ticks.x = element_blank(),
        axis.ticks.y = element_blank(),
        axis.line.x = element_blank(),
        axis.line.y = element_blank())

References

References:

https://r4ds.had.co.nz/data-visualisation.html

https://ggplot2.tidyverse.org/

https://r4ds.had.co.nz/graphics-for-communication.html

http://r-statistics.co/ggplot2-Tutorial-With-R.html

http://r-statistics.co/Top50-Ggplot2-Visualizations-MasterList-R-Code.html

https://beanumber.github.io/sds192/lab-ggplot2.html